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Abstract. Published exactly seventy years ago, Jeffreys's Theory of 
Probability (1939) has had a unique impact on the Bayesian commu- 
nity and is now considered to be one of the main classics in Bayesian 
Statistics as well as the initiator of the objective Bayes school. In par- 
ticular, its advances on the derivation of noninformative priors as well 
as on the scaling of Bayes factors have had a lasting impact on the field. 
However, the book reflects the characteristics of the time, especially in 
terms of mathematical rigor. In this paper we point out the fundamen- 
tal aspects of this reference work, especially the thorough coverage of 
testing problems and the construction of both estimation and testing 
noninformative priors based on functional divergences. Our major aim 
here is to help modern readers in navigating in this difficult text and 
in concentrating on passages that are still relevant today. 
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1. INTRODUCTION 

The theory of probability makes it possible to 
respect the great men on whose shoulders we stand. 
H. Jeffreys, Theory of Probability, Section 1.6. 

Few Bayesian books other than Theory of Proba- 
bility are so often cited as a foundational text.^ This 
book is rightly considered as the principal reference 
in modern Bayesian statistics. Among other innova- 
tions, Theory of Probability states the general princi- 
ple for deriving noninformative priors from the sam- 
pling distribution, using Fisher information. It also 



This is an electronic reprint of the original article 
published by tlie Institute of Mathematical Statistics in 
Statistical Science, 2009, Vol. 24, No. 2, 141-172. This 
reprint differs from the original in pagination and 
typographic detail. 



^Among the "Bayesian classics," only Savage (1954), DeG- 
root (1970) and Berger (1985) seem to get more citations than 
Jeffreys (1939, 1948, 1961), the more recent book by Bernardo 
and Smith (1994) coming fairly close. The homonymous The- 
ory of Probability by de Einetti (1974, 1975) gets quoted a 
third as much [Source: Google Scholar). 



2 



C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU 



proposes a clear processing of Bayesian testing, in- 
cluding the dimension-free scaling of Bayes factors. 
This comprehensive treatment of Bayesian inference 
from an objective Bayes perspective is a major inno- 
vation for the time, and it has certainly contributed 
to the advance of a field that was then submitted to 
severe criticisms by R. A. Fisher (Aldrich, 2008) and 
others, and was in danger of becoming a feature of 
the past. As pointed out by Zellner (1980) in his in- 
troduction to a volume of essays in honor of Harold 
Jeffreys, a fundamental strength of Theory of Prob- 
ability is its affirmation of a unitarian principle in 
the statistical processing of all fields of science. 

For a 21st century reader, Jeffreys's Theory of 
Probability is nonetheless puzzling for its lack of 
formalism, including its difficulties in handling im- 
proper priors, its reliance on intuition, its long de- 
bate about the nature of probability, and its re- 
peated attempts at philosophical justifications. The 
title itself is misleading in that there is absolutely 
no exposition of the mathematical bases of probabil- 
ity theory in the sense of Billingsley (1986) or Feller 
(1970): "Theory of Inverse Probability" would have 
been more accurate. In other words, the style of the 
book appears to be both verbose and often vague in 
its mathematical foundations for a modern reader.^ 
(Good, 1980, also acknowledges that many passages 
of the book are "obscure." ) It is thus difficult to ex- 
tract from this dense text the principles that made 
Theory of Probability the reference it is nowadays. 
In this paper we endeavor to revisit the book from 
a Bayesian perspective, in order to separate founda- 
tional principles from less relevant parts. 

This review is neither a historical nor a critical 
exercise: while conscious that Theory of Probabil- 
ity reflects the idiosyncrasies both of the scientific 
achievements of the 1930's — with, in particular, the 
emerging formalization of Probability as a branch of 
Mathematics against the ongoing debate on the na- 
ture of probabilities — and of Jeffreys's background — 
as a geophysicist — , we aim rather at providing the 
modern reader with a reading guide, focusing on the 
pioneering advances made by this book. Parts that 
correspond to the lack (at the time) of analytical 
(like matrix algebra) or numerical (like simulation) 
tools and their substitution by approximation de- 
vices (that are not used any longer, even though 



^In order to keep readability as higli as possible, we shall 
use modern notation whenever the original notation is either 
unclear or inconsistent, for example, Greek letters for param- 
eters and roman letters for observations. 



they may be surprisingly accurate), and parts that 
are linked with Bayesian perspectives will be covered 
fleetingly. Thus, when pointing out notions that may 
seem outdated or even mathematically unsound by 
modern standards, our only aim is to help the mod- 
ern reader stroll past them, and we apologize in ad- 
vance if, despite our intent, our tone seems overly 
presumptuous: it is rather a reflection of our igno- 
rance of the current conditions at the time since (to 
borrow from the above quote which may sound it- 
self somehow presumptuous) we stand respectfully 
at the feet of this giant of Bayesian Statistics. 

The plan of the paper follows Theory of Probabil- 
ity linearly by allocating a section to each chapter of 
the book (Appendices are only mentioned through- 
out the paper). Section 10 contains a brief conclu- 
sion. Note that, in the following, words, sentences 
or passages quoted from Theory of Probability are 
written in italics with no precise indication of their 
location, in order to keep the style as light as pos- 
sible. We also stress that our review is based on 
the third edition of Theory of Probability (Jeffreys, 
1961), since this is both the most matured and the 
most available version (through the last reprint by 
Oxford University Press in 1998). Contemporary re- 
views of Theory of Probability are found in Good 
(1962) and Lindley (1962). 

2. CHAPTER I: FUNDAMENTAL NOTIONS 

The posterior probabilities of the hypotheses are 
proportional to the products of the prior 
probabilities and the likelihoods. 
H. Jeffreys, Theory of Probability, Section 1.2. 

The first chapter of Theory of Probability sets gen- 
eral goals for a coherent theory of induction. More 
importantly, it proposes an axiomatic (if slightly 
tautological) derivation of prior distributions, while 
justifying this approach as coherent, compatible with 
the ordinary process of learning and allowing for the 
incorporation of imprecise information. It also rec- 
ognizes the fundamental property of coherence when 
updating posterior distributions, since they can be 
used as the prior probability in taking into account 
of a further set of data. Despite a style that is often 
difficult to penetrate, this is thus a major chapter of 
Theory of Probability. It will also become clearer at a 
later stage that the principles exposed in this chap- 
ter correspond to the (modern) notion of objective 
Bayes inference: despite mentions of prior probabil- 
ities as reflections of prior belief or existing pieces of 
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information, Theory of Probability remains strictly 
"objective" in that prior distributions are always de- 
rived analytically from sampling distributions and 
that all examples are treated in a noninformative 
manner. One may find it surprising that a physi- 
cist like Jeffreys does not emphasise the appeal of 
subjective Bayes, that is, the ability to take into ac- 
count genuine prior information in a principled way. 
But this is in line with both his predecessors, in- 
cluding Laplace and Bayes, and their use of uniform 
priors and his main field of study that he perceived 
as objective (Lindley, 2008, private communication), 
while one of the main appeals of Theory of Probabil- 
ity is to provide a general and coherent framework 
to derive objective priors. 

2.1 A Philosophical Exercise 

The chapter starts in Section 1.0 with an epis- 
temological discussion of the nature of (statistical) 
inference. Some sections are quite puzzling. For in- 
stance, the example that the kinematic equation for 
an object in free-fall, 

s = a + ut-\- ^gt'^, 

cannot be deduced from observations is used as an 
argument against deduction under the reasoning that 
an infinite number of functions, 

s = a + ut+ \gt^ + f{t){t -ti)---{t- tn), 

also apply to describe a free fall observed at times 
ti,...,tn- The limits of the epistemological discus- 
sion in those early pages are illustrated by the in- 
troduction of Ockham's razor (the choice of the sim- 
plest law that fits the fact), as the meaning of what 
a simplest law can be remains unclear, and the sec- 
tion lacks a clear (objective) argument in motivating 
this choice, besides common sense, while the discus- 
sion ends up with a somehow paradoxical statement 
that, since deductive logic provides no explanation 
of the choice of the simplest law, this is proof that 
deductive logic is grossly inadequate to cover scien- 
tific and practical requirements. On the other hand, 
and from a statistician's narrower perspective, one 
can re-interpret this gravity example as possibly the 
earliest discussion of the conceptual difficulties asso- 
ciated with model choice, which are still not entirely 
resolved today. In that respect, it is quite fascinat- 
ing to see this discussion appear so early in the book 
(third page), as if Jeffreys had perceived how impor- 
tant this debate would become later. 



Note that, maybe due to this very call to Ockham, 
the later Bayesian literature abounds in references 
to Ockham's razor with little formalization of this 
principle, even though Berger and Jefferys (1992), 
Balasubramanian (1997) and MacKay (2002) develop 
elaborate approaches. In particular, the definition 
of the Bayes factor in Section 1.6 can be seen as a 
partial implementation of Ockham's razor when set- 
ting the probabilities of both models equal to 1/2. 
In the beginning of his Chapter 28, entitled Model 
Choice and Occam's Razor, MacKay (2002) argues 
that Bayesian inference embodies Ockham's razor 
because "simple" models tend to produce more pre- 
cise predictions and, thus, when the data is equally 
compatible with several models, the simplest one 
will end up as the most probable. This is generally 
true, even though there are some counterexamples 
in Bayesian nonparametrics. 

Overall, we nonetheless feel that this part of The- 
ory of Probability could be skipped at first read- 
ing as less relevant for Bayesian studies. In particu- 
lar, the opposition between mathematical deduction 
and statistical induction does not appear to carry a 
strong argument, even though the distinction needs 
(needed?) to be made for mathematically oriented 
readers unfamiliar with statistics. However, from a 
historical point of view, this opposition must be con- 
sidered against the then-ongoing debate about the 
nature of induction, as illustrated, for instance, by 
Karl Popper's articles of this period about the logi- 
cal impossibility of induction (Popper, 1934). 

2.2 Foundational Principles 

The text becomes more focused when dealing with 
the construction of a theory of inference: while some 
notions are yet to be defined, including the pervasive 
evidence, sentences like inference involves in its very 
nature the possibility that the alternative chosen as 
the most likely may in fact be wrong are in line with 
our current interpretation of modeling and obviously 
with the Bayesian paradigm. In Section 1.1 Jeffreys 
sets up a collection of postulates or rules that act 
like axioms for his theory of inference, some of which 
require later explanations to be fully understood: 

1. All hypotheses must be explicitly stated and the 
conclusions must follow from the hypotheses: what 
may first sound like an obvious scientific principle 
is in fact a leading characteristic of Bayesian statis- 
tics. While it seems to open a whole range of new 
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questions — "To what extent must we define our be- 
lief in the statistical models used to build our in- 
ference? How can a unique conclusion stem from a 
given model and a given set of observations?" — and 
while it may sound far too generic to be useful, we 
may interpret this statement as setting the working 
principle of Bayesian decision theory: given a prior, 
a sampling distribution, an observation and a loss 
function, there exists a single decision procedure. 
In contrast, the frequentist theories of Neyman or 
of Fisher require the choice of ad hoc procedures, 
whose (good or bad) properties they later analyze. 
But this may be a far-fetched interpretation of this 
rule at this stage even though the comment will ap- 
pear more clearly later. 

2. The theory must be self- consistent. The state- 
ment is somehow a repetition of the previous rule 
and it is only later (in Section 3.10) that its mean- 
ing becomes clearer, in connection with the intro- 
duction of Jeffreys's noninformative priors as a self- 
contained principle. Consistency is nonetheless a dom- 
inant feature of the book, as illustrated in Section 3.1 
with the rejection of Haldane's prior. ^ 

3. Any rule must be applicable in practice. This 
"rule" does not seem to carry any weight in prac- 
tice. In addition, the explicit prohibition of esti- 
mates based on impossible experiments sounds im- 
plementable only through deductive arguments. But 
this leads to the exclusion of rules based on fre- 
quency arguments and, as such, is fundamental in 
setting a Bayesian framework. Alternatively (and 
this is another interpretation) , this constraint should 
be worded in more formal terms of the measurability 
of procedures. 

4. The theory must provide explicitly for the pos- 
sibility that inferences made by it may turn out to 
be wrong. This is both a fundamental aspect of sta- 
tistical inference and an indication of a surprising 
view of inference. Indeed, even when conditioning 
on the model, inference is never right in the sense 
that a point estimate rarely gives the true answer. 
It may be that Jeffreys is solely thinking of sta- 
tistical testing, in which case the rightfulness of a 
decision is necessarily conditional on the truthful- 
ness of the corresponding model and thus dubious. 
A more relative (or more precise) statement would 



■^Consistency is then to be understood in the weak sense of 
invariant under reparameterization, which is a usual argument 
for Jeffreys's principle, not in terms of asymptotic convergence 
properties. 



have been more adequate. But, from reading fur- 
ther (as in Section 1.2), it appears that this rule is 
to be understood as the foundational principle {the 
chief constructive rule) for defining prior distribu- 
tions. While this is certainly not clear at this stage, 
Bayesian inference does indeed provide for the pos- 
sibility that the model under study is not correct 
and for the unreliability of the resulting inference 
via a posterior probability. 

5. The theory must not deny any empirical propo- 
sition a priori. This principle remains unclear when 
put into practice. If it is to be understood in the 
sense of a physical theory, there is no reason why 
some empirical proposition could not be excluded 
from the start. If it is the sense of an inferential 
theory, then the statement would require a better 
definition of empirical proposition. But Jeffreys us- 
ing the epithet a priori seems to imply that the prior 
distribution corresponding to the theory must be as 
inclusive as possible. This certainly makes sense as 
long as prior information does not exclude parts of 
the parameter space as, for instance, in Physics. 

6. The number of postulates should be reduced to a 
minimum. This rule sounds like an embedded Ock- 
ham's razor, but, more positively, it can also be in- 
terpreted as a call for noninformative priors. Once 
again, the vagueness of the wording opens a wide 
range of interpretations. 

7. The theory need not represent thought-processes 
in details, but should agree with them in outline. 
This vague principle could be an attempt at rec- 
onciliating statistical theories, but it does not give 
clear directions on how to proceed. In the light of 
Jeffreys's arguments, it could rather signify that the 
construction of prior distributions cannot exactly re- 
flect an actual construction in real life. Since a non- 
informative (or "objective") perspective is adopted 
for most of the book, this is more likely to be a pre- 
liminary argument in favor of this line of thought. In 
Section 1.2 this rule is invoked to derive the (prior) 
ordering of events. 

8. An objection carries no weight if [it] would in- 
validate part of pure mathematics. This rule grounds 
Theory of Probability within mathematics, which may 
be a necessary reminder in the spirit of the time 
(where some were attempting to dissociate statis- 
tics from mathematics). 

The next paragraph discusses the notion of prob- 
ability. Its interest is mostly historical: in the early 
1930's, the axiomatic definition of probability based 
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on Kolmogorov's axioms was not yet universally ac- 
cepted, and there were still attempts to base this 
definition on limiting properties. In particular, 
Lebesgue integration was not part of the undergrad- 
uate curriculum till the late 1950's at either Cam- 
bridge or Oxford (Lindley, 2008, private communi- 
cation). This debate is no longer relevant, and the 
current theory of probability, as derived from mea- 
sure theory, does not bear further discussion. This 
also removes the ambiguity of constructing objective 
probabilities as derived from actual or possible ob- 
servations. A probability model is to be understood 
as a mathematical (and thus unobjectionable) con- 
struct, in agreement with Rule 8 above. 

Then follows (still in Section 1.1) a rather long 
debate on causality versus determinism. While the 
principles stated in those pages are quite accept- 
able, the discussion only uses the most basic concept 
of determinism, namely, that identical causes give 
identical effects, in the sense of Laplace. We thus 
agree with Jeffreys that, at this level, the principle 
is useless, but the same paragraph actually leaves 
us quite confused as to its real purpose. A likely ex- 
planation (Lindley, 2008, personal communication) 
is that Jeffreys stresses the inevitability of probabil- 
ity statements in Science: (measurement) errors are 
not mistakes but part of the picture. 

2.3 Prior Distributions 

In Section 1.2 Jeffreys introduces the notion of 
prior in an indirect way, by considering that the 
probability of a proposition is always conditional on 
some data and that the occurrence of new items of 
information {new evidence) on this proposition sim- 
ply updates the available data. This is slightly con- 
trary to our current way of defining a prior distri- 
bution vr on a parameter 6 as the information avail- 
able on 9 prior to the observation of the data, but 
it simply conveys the fact that the prior distribu- 
tion must be derived from some prior items of in- 
formation about 6. As pointed out by Jeffreys, this 
also allows for the coexistence of prior distributions 
for different experts within the same probabilistic 
framework.^ In the sequel all statements will, how- 
ever, condition on the same data. 

The following paragraphs derive standard math- 
ematical logic axioms that directly follow from a 



''Jeffreys seems to further note that the same conditioning 
applies for the model of reference. 



formal (modern) definition of a probability distri- 
bution, with the provision that this probability is 
always conditional on the same data. This is also 
reminiscent of the derivation of the existence of a 
prior distribution from an ordering of prior proba- 
bilities in DeGroot (1970), but the discussion about 
the arbitrary ranking of probabilities between and 
1 may sound anecdotal today. Note also that, from 
a mathematical point of view, defining only condi- 
tional probabilities like P{p\q) is somehow superflu- 
ous in that, if the conditioning q is to remain fixed, 
P{-\q) is a regular probability distribution, while, if 
q is to be updated into qr, P{-\qr) can be derived 
from P{-\q) by Bayes' theorem (which is to be in- 
troduced later). Therefore, in all cases, P{-\q) ap- 
pears like the reference probability. At some stage, 
while stating that the probability of the sure event 
is equal to one is merely a convention, Jeffreys indi- 
cates that, when expressing ignorance over an infi- 
nite range of values of a quantity, it may be conve- 
nient to use oo instead. Clearly, this paves the way 
for the introduction of improper priors.^ Unfortu- 
nately, the convention and the motivation (to keep 
ratios for finite ranges determinate) do not seem 
correct, if in tune with the perspective of the time 
(see, e.g., Lhoste, 1923; Broemeling and Broemel- 
ing, 2003). Notably, setting all events involving an 
infinite range with a probability equal to oo seems 
to restrict the abilities of the theory to a far ex- 
tent.^ Similar to Laplace, Jeffreys is more used to 
handling equal probability finite sets than continu- 
ous sets and the extension to continuous settings is 
unorthodox, using, for instance, Dedekind's sections 
and putting several meanings under the notation dx. 
Given the convoluted derivation of conditional prob- 
abilities in this context, the book states the product 
rule P{qr\p) = P{q\p)P{r\qp) as an axiom, rather 
than as a consequence of the basic probability ax- 
ioms. It leads (in Section 1.22) to Bayes' theorem, 



^Jeffreys's Theory of Probability strongly differs from the 
earlier Scientific Inference (1931) in this respect, the latter be- 
ing rather dismissive of the mathematical difficulty: To make 
this integral equal to 1 we should therefore have to include 
a zero factor unless very small and very large values are ex- 
cluded. This does appear to be the case (Section 5.43, page 
67). 

®This difficulty with handling cr-finite measures and contin- 
uous variables will be recurrent throughout the book; Jeffreys 
does not seem to be adverse to normalizing an improper dis- 
tribution by oo, even though the corresponding derivations 
are not meaningful. 
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namely, that, for all events qr, 

P{qr\pH)^P{qr\H)P{p\qrH), 

where H denotes the information available and p a 
set of observations. In this (modern) format P{p\qrH) 
is identified as Fisher likelihood and P{qr\H) as the 
prior probability. Bayes' theorem is defined as the 
principle of inverse probability and only for finite 
sets, rather than for measures/ Obviously, the gen- 
eral version of Bayes' theorem is used in the sequel 
for continuous parameter spaces. 

Section 1.3 represents one of the few forays of 
the book into the realm of decision theory,^ in con- 
nection with Laplace's notions of mathematical and 
moral expectations, and with Bernoulli's Saint Pe- 
tersburg paradox, but there is no recognition of the 
central role of the loss function in defining an opti- 
mal Bayes rule as formalized later by Wald (1950) 
and Raiffa and Schlaifer (1961). The attribution of 
a decision-theoretic background to T. Bayes himself 
is surprising, since there is not anything close to the 
notion of loss or of benefit in Bayes' (1763) origi- 
nal paper. We nonetheless find there the seed of an 
idea later developed in Rubin (1987), among oth- 
ers, that prior and loss function are indistinguish- 
able. [Section 1.8 briefly re-enters this perspective 
to point out that (posterior) expectations are often 
nowhere near the actual value of the random quan- 
tity.] The next section (Section 1.4) is important in 
that it tackles for the flrst time the issue of nonin- 
formative priors. When the number of alternatives 
is finite, Jeffreys picks the uniform prior as his non- 
informative prior, following Laplace's Principle of 
Insufficient Reason. The difficulties associated with 
this choice in continuous settings are not mentioned 
at this stage. 

2.4 More Axiomatics and Some Asymptotics 

Section 1.5 attempts an axiomatic derivation that 
the Bayesian principles just stated follow the rules 



'^As noted by Fienberg (2006), the adjective term 
"Bayesian" had not yet appeared in the statistical literature 
by the time Theory of Probability was published, and Jeffreys 
sticks to the 19th century denomination of "inverse probabil- 
ity." The adjective can be traced back to either Ronald Fisher, 
who used it in a rather derogatory meaning, or to Abraham 
Wald, who gave it a more complimentary meaning in Wald 
(1950). 

*The reference point estimator advocated by Jeffreys (if 
any) seems to be the maximum a posteriori (MAP) estimator, 
even though he stated in his discussion of Lindley (1953) that 
he deprecated the whole idea of picking out a unique estimate. 



imposed earlier. This part does not bring much nov- 
elty, once the fundamental properties of a proba- 
bility distribution are stated. This is basically the 
purpose of this section, where earlier "Axioms" are 
checked in terms of the posterior probability P{-\pH). 
A reassuring consequence of this derivation is that 
the use of a posterior probability as the basis for 
inference cannot lead to inconsistency. The use of 
the posterior as a new prior for future observations 
and the corresponding learning principle are devel- 
oped at this stage. The debate about the choice of 
the prior distribution is postponed till later, while 
the issue of the influence of this prior distribution 
is dismissed as having very little difference [on] the 
results, which needs to be quantified, as in the quote 
below at the beginning of Section 5. 

Given the informal approach to (or rather with- 
out) measure theory adopted in Theory of Probabil- 
ity, the study of the limiting behavior of posterior 
distributions in Section 1.6 does not provide much 
insight. For instance, the fact that 

P{q\pi---PnH) 

^ pm) 

P{pi\H)P{p2\piH) ■ ■ ■ P{pn\pi ■ ■ -Pn-lH) 

is shown to induce that P{pn\pi ■ ■ -Pn-iH) converges 
to 1 is not particularly surprising, although it relates 
to Laplace's principle that repeated verifications of 
consequences of a hypothesis will make it practically 
certain that the next consequence will be verified. It 
would have been equally interesting to focus on cases 
in which P{q\pi ■ ■ -PnH) goes to 1. 

The end of Section 1.62 introduces some quanti- 
ties of interest, such as the distinction between esti- 
mation problems and significance tests, but with no 
clear guideline: when comparing models of complex- 
ity m (this quantity being only defined for differen- 
tial equations), Jeffreys suggests using prior prob- 
abilities that are penalized by m, such as 2^™" or 
6/7r^m^, the motivation for those specific values be- 
ing that the corresponding series converge. Penal- 
ization by the model complexity is quite an inter- 
esting idea, to be formalized later by, for example, 
Rissanen (1983, 1990), but Jeffreys somehow kills 
this idea before it is hatched by pointing out the 
difficulties with the definition of m. 

Instead, Jeffreys switches to a completely different 
(if paramount) topic by defining in a few lines the 
Bayes factor for testing a point null hypothesis, 

P{q\eH) / P{q\H) 
P{q'\eH)/ P{q'\Hy 
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where 6 denotes the data. He suggests using P{q\H) = 
1/2 as a default value, except for sequences of em- 
bedded hypotheses for which he suggests 

P{q'\H) ' 

presumably because the series with leading term 2~" 
is converging. 

Once again, the rather quick coverage of this ma- 
terial is somehow frustrating, as further justifica- 
tions would have been necessary for the choice of 
the constant and so on.^ Instead, the chapter con- 
cludes with a discussion of the distinction between 
"idealism" and "realism" that can be skipped for 
most purposes. 

3. CHAPTER II: DIRECT PROBABILITIES 

The whole of the information contained in the 
observations that is relevant to the posterior 
probabilities of different hypotheses is summed 
up in the values that they give to the likelihood. 
H. Jeffreys, Theory of Probability, Section 2.0. 

This chapter is certainly the least "Bayesian" chap- 
ter of the book, since it covers both the standard 
sampling distributions and some equally standard 
probability results. It starts with a reminder that 
the principle of inverse probability can be stated in 
the form 

Posterior Probability oc Prior Probability 

■ Likelihood, 

thus rephrasing Bayes' theorem in terms of the like- 
lihood and with the proper indication that the rel- 
evant information contained in the observations is 
summarized by the likelihood {sufficiency will be 
mentioned later in Section 3.7). Then follows (still 
in Section 2.0) a long paragraph about the tenta- 
tive nature of models, concluding that a statistical 
model must be made part of the prior information 
H before it can be tested against the observations, 
which (presumably) relates to the fact that Bayesian 



Similarly, the argument against philosophers that main- 
tain that no method based on the theory of probability can give 
a (...) non-zero probability to a precise value against a contin- 
uous background is not convincing as stated. The distinction 
between zero measure events and mixture priors including a 
Dirac mass should have been better explained, since this is 
the basis for Bayesian point-null testing. 



model assessment must involve a description of the 
alternative(s) to be validated. 

The main bulk of the chapter is about sampling 
distributions. Section 2.1 introduces binomial and 
hypergeometric distributions at length, including the 
interesting problem of deciding between binomial 
versus negative binomial experiments when faced 
with the outcome of a survey, used later in the de- 
fence of the Likelihood Principle (Berger and Wolpert, 
1988). The description of the binomial contains the 
equally interesting remark that a given coin repeat- 
edly thrown will show a bias toward head or tail due 
to the wear, a remark later exploited in Diaconis and 
Ylvisaker (1985) to justify the use of mixtures of 
conjugate priors. Bernoulli's version of the Central 
Limit theorem is also recalled in this section, with 
no particular appeal if one considers that a mod- 
ern Statistics course (see, e.g., Casella and Berger, 
2001) would first start with the probabilistic back- 
ground.^^ 

The Poisson distribution is first introduced as a 
limiting distribution for the binomial distribution 
B{n,p) when n is large and np is bounded. (Connec- 
tions with radioactive disintegration are mentioned 
afterward.) The normal distribution is proposed as 
a large sample approximation to a sum of Bernoulli 
random variables. As for the other distributions, 
there is some attempt at justifying the use of the 
normal distribution, as well as [what we find to be] 
a confusing paragraph about the "true" and "actual 
observed" values of the parameters. A long section 
(Section 2.3) expands about the properties of Pear- 
son's distributions, then allowing Jeffreys to intro- 
duce the negative binomial as a mixture of Poisson 
distributions. The introduction of the bivariate nor- 
mal distribution is similarly convoluted, using first 
binomial variates and second a limiting argument, 
and without resorting to matrix formalism. 

Section 2.6 attempts to introduce cumulative dis- 
tribution functions in a more formal manner, using 
the current three-step definition, but again dealing 
with limits in an informal way. Rather coherently 
from a geophysicist's point of view, characteristic 
functions are also covered in great detail, including 
connections with moments and the Cauchy distribu- 
tion, as well as Levy's inversion theorem. The main 



In fact, some of the statements in Theory of Probability 
that surround the statement of the Central Limit theorem are 
not in agreement with measure theory, as, for instance, the 
confusion between pointwise and uniform convergence, and 
convergence in probability and convergence in distribution. 
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goal of using characteristic functions seems nonethe- 
less to be able to establish the Central Limit theo- 
rem in its full generality (Section 2.664). 

Rather surprisingly for a Bayesian reference book 
and mostly in complete disconnection with the test- 
ing chapters, the ^ test of goodness of fit is given a 
large and uncritical place within this book, includ- 
ing an adjustment for the degrees of freedom. Ex- 
amples include the obvious independence of a rect- 
angular contingency table. The only criticism (Sec- 
tion 2.76) is fairly obscure in that it blames poor 
performances of the test on the fact that all di- 
vergences in the sum are equally weighted. The 
test is nonetheless implemented in the most classi- 
cal manner, namely, that the hypothesis is rejected 
if the statistic is outside the standard interval. 
It is unclear from the text in Section 2.76 that re- 
jection would occur were the x^ statistic too small, 
even though Jeffreys rightly addresses the issue at 
the end of Chapter 5 (Section 5.63). He also men- 
tions the need to coalesce small groups into groups 
of size at least 5 with no further justification. The 
chapter concludes with similar uses of Student's t 
and Fisher's z tests. 

4. CHAPTER III: ESTIMATION PROBLEMS 

// we have no information relevant to the actual 
value of the parameter, the probability must be 
chosen so as to express the fact that we have none. 
H. Jeffreys, Theory of Probability, Section 3.1. 

This is a major chapter of Theory of Probabil- 
ity as it introduces both exponential families and 
the principle of Jeffreys noninformative priors. The 
main concepts are already present in the early sec- 
tions, including some invariance principles. The pur- 
pose of the chapter is stated as a point estimation 
problem, where obtaining the probability distribu- 
tion of [the] parameters, given the observations is 
the goal. Note that estimation is not to be under- 
stood in the (modern?) sense of point estimation, 
that is, as a way to produce numerical substitutes 
for the true parameters that are based on the data. 



Interestingly enough, the parameters are estimated by 
minimum rather than either maximum likeUhood or 
Bayesian point estimates. This is, again, a reflection of the 
practice of the time, coupled with the fact that most ap- 
proaches are asymptotically indistinguishable. Posterior ex- 
pectations are not at all advocated as Bayes (point) estima- 
tors in Theory of Probability. 



since the decision-theoretic perspective for building 
(point) estimators is mostly missing from the book 
(see Section 1.8 for a very brief remark on expecta- 
tions). Both Good (1980) and Lindley (1980) stress 
this absence. 

4.1 Noninformative Priors of Former Days 

Section 3.1 sets the principles for selecting nonin- 
formative priors. Jeffreys recalls Laplace's rule that, 
if a parameter is real-valued, its prior probability 
should be taken as uniformly distributed, while, if 
this parameter is positive, the prior probability of its 
logarithm should be taken as uniformly distributed. 
The motivation advanced for using both priors is 
the invariance principle, namely, the invariance of 
the prior selection under several different sets of pa- 
rameters. At this stage, there is no recognition of 
a potential problem with using a cj-finite measure 
and, in particular, with the fact that these priors 
are not probability distributions, but rather a sim- 
ple warning that these are formal rules expressing 
ignorance. We face the difficulty mentioned earlier 
when considering cr-finite measures since they are 
not properly handled at this stage: when stating 
that one starts with any distribution of prior proba- 
bility, it is not possible to include u-finite measures 
this way, except via the [incorrect] argument that a 
probability is merely a number and, thus, that the 
total weight can be oo as well as 1: use oo instead 
of 1 to indicate certainty on data H. The wrong in- 
terpretation of a cr-finite measure as a probability 
distribution (and of oo as a "number") then leads 
to immediate paradoxes, such as the prior proba- 
bility of any finite range being null, which sounds 
inconsistent with the statement that we know noth- 
ing about the parameter, but this results from an 
over-interpretation of the measure as a probability 
distribution already pointed out by Lindley (1971, 
1980) and Kass and Wasserman (1996). 

The argument for using a fiat (Lebesgue) prior is 
based (a) on its use by both Bayes and Laplace in 
finite or compact settings, and (b) on the argument 
that it correctly reflects the absence of prior knowl- 
edge about the value of the parameter. At this stage, 
no point is made against it for reasons related with 
the invariance principle — there is only one parame- 
terization that coincides with a uniform prior — but 
Jeffreys already argues that flat priors cannot be 
used for signiflcance tests, because they would al- 
ways reject the point null hypothesis. Even though 
Bayesian significance tests, including Bayes factors. 
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have not yet been properly introduced, the notion 
of an infinite mass cancehng a point null hypothesis 
is sufficiently intuitive to be used at this point. 

While, indeed, using an improper prior is a ma- 
jor difficulty when testing point null hypotheses be- 
cause it gives an infinite mass to the alternative 
(DeGroot, 1970), Jeffreys fails to identify the prob- 
lem as such but rather blames the flat prior applied 
to a parameter with a semi-infinite range of pos- 
sible values. He then goes on justifying the use of 
7r(o") = 1/(7 for positive parameters (replicating the 
argument of Lhoste, 1923) on the basis that it is 
invariant for the change of parameters g = 1/a, as 
well as any other power, failing to recognize that 
other transforms that preserve positivity do not ex- 
hibit such an invariance. One has to admit, however, 
that, from a physicist's perspective, power trans- 
forms are more important than other mathematical 
transforms, such as arctan, because they can be as- 
signed meaningful units of measurement, while other 
functions cannot. At least this seems to be the spirit 
of the examples considered in Theory of Probability: 
Some methods of measuring the charge of an elec- 
tron give e, others e^ . 

There is a vague indication that Jeffreys may also 
recognize 7r((T) = l/a as the scale group invariant 
measure, but this is unclear. An indefensible argu- 
ment follows, namely, that 

j v'^dv/ j v'^dv 

is only indeterminate when n = — 1 , which allows us 
to avoid contradictions about the lack of prior in- 
formation. Jeffreys acknowledges that this does not 
solve the problem since this choice implies that the 
prior "probability" of a finite interval (a, b) is then 
always null, but he avoids the difficulty by admit- 
ting that the probability that a falls in a particu- 
lar range is zero, because zero probability does not 
imply impossibility. He also acknowledges that the 
invariance principle cannot encompass the whole 
range of transforms without being inconsistent, but 
he nonetheless sticks to the 7r(a") = 1/cr prior as it 
is better than the Bayes-Laplace rule.^"^ Once again, 
the argument sustaining the whole of Section 3.1 is 



In both the 19th and early 20th centuries, there is a tra- 
dition within the not-yet-Bayesian literature to go to extreme 
lengths in the justification of a particular prior distribution, as 
if there existed one golden prior. See, for example, Broemeling 
and Broemeling (2003) in this respect. 



incomplete since missing the fundamental issue of 
distinguishing proper from improper priors. 

While Haldane's (1932) prior on probabilities (or 
rather on chances as defined in Section 1.7), 

is dismissed as too extreme (and inconsistent), there 
is no discussion of the main difficulty with this prior 
(or with any other improper prior associated with a 
finite-support sampling distribution), which is that 
the corresponding posterior distribution is not de- 
fined when X ~ B{n,p) is either equal to or to n 
(although Jeffreys concludes that x = leads to a 
point mass at p = 0, due to the infinite mass nor- 
malization).^^ Instead, the corresponding Jeffreys's 
prior 

is suggested with little justification against the (truly) 
uniform prior: we may as well use the uniform dis- 
tribution. 

4.2 Laplace's Succession Rule 

Section 3.2 contains a Bayesian processing of 
Laplace's succession rule, which is an easy introduc- 
tion given that the parameter of the sampling dis- 
tribution, a hyper geometric T-L{N,r), is an integer. 
The choice of a uniform prior on r, 7r(r) = 1/(A^-|-1), 
does not require much of a discussion and the pos- 
terior distribution 

.(.,„....).(';)(-; 

is available in closed form, including the normal- 
izing constant. The posterior predictive probability 
that the next specimen will be of the same type is 
then (^-|-l)/(/-|-m + l) and more complex predictive 
probabilities can be computed as well. As in earlier 
books involving Laplace's succession rule, the sec- 
tion argues about its truthfulness from a metaphys- 
ical point of view (using classical arguments about 



^''Jeffreys (1931, 1937) does address the problem in a 
clearer manner, stating that this is not serious, for so long 
as the sample is homogeneous (meaning x — 0,n) the ex- 
treme values (meaning p = 0,1) are still admissible, and we 
do attach a high probability to the proposition is of one type; 
while as soon as any exceptions are known the extreme values 
are completely excluded and no infinity arises (Section 10.1, 
page 195). 
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the probabilities that the sun rising tomorrow and 
that all swans are white that always seem to be as- 
sociates themselves with this topic) but, more inter- 
estingly, it then moves to introducing a point mass 
on specific values of the parameter in preparation 
for hypothesis testing. Namely, following a renewed 
criticism of the uniform assessment via the fact that 

P{r = N\l,m = 0,N,H) _ l + l 
P{rj^N\l = n,N, H) ~ N+l 

is too small, Jeffreys suggests setting aside a portion 
2k of the prior mass for both extreme values r = 
and r = N. This is indeed equivalent to using a point 
mass on the null hypothesis of homogeneity of the 
population. While mixed samples are independent 
of the choice of k (since they exclude those extreme 
values) , a sample of the first type with I = n leads 
to a posterior probability ratio of 

P{r = N\l = n,N,H)_n + l k N-l 
P{r^N\l = n,N,H) ~ N-nl-2k 1 ' 

which leads to the crucial question of the choice^^ of 
k. The ensuing discussion is not entirely convincing: 
I is too large, | is not unreasonable [but] too low in 
this case. The alternative 

k = --\ 

4 iV + 1 

argues that the classification of possibilities [is] as 
follows: (1) Population homogeneous on account of 
some general rule. (2) No general rule but extreme 
values to be treated on a level with others. This pro- 
posal is mostly interesting for its bearing on the 
continuous case, for, in the finite case, it does not 
sound logical to put weight on the null hypothesis 
(r = and r = A^) within the alternative, since this 
confuses the issue. (See Berger, Bernardo and Sun, 
2009, for a recent reappraisal of this approach from 
the point of view of reference priors.) 

Section 3.3 seems to extend Laplace's succession 
rule to the case in which the class sampled consists 
of several types, but it actually deals with the (much 
more interesting) case of Bayesian inference for the 
multinomial Ai{n;pi, . . . ,pr) distribution, when us- 
ing the Dirichlet T>{1, . . . , 1) distribution as a prior. 
Jeffreys recovers the Dirichlet I'(xi + 1, . . . ,Xr + 1) 



distribution as the posterior distribution and he de- 
rives the predictive probability that the next member 
will be of the first type as 

(xi + l)/'^Xi + r. 

i 

There could be some connections there with the ir- 
relevance of alternative hypotheses later (in time) 
discussed in polytomous regression models (Gourieroux 
and Monfort, 1996), but they are well hidden. In any 
case, the Dirichlet distribution is not invariant to the 
introduction of new types. 

4.3 Poisson Distribution 

The processing of the estimation of the parameter 
a of the Poisson distribution V{a) is based on the 
[improper] prior 7r(a) oc 1/a, deemed to be the cor- 
rect prior probability distribution for scale invariance 
reasons. Given n observations from V{a) with sum 
Sn, Jeffreys reproduces Haldane's (1932) derivation 
of the Gamma posterior Qa{Sn,n) and he notes that 
Sn is a sufficient statistic, but does not make a gen- 
eral property of it at this stage. (This is done in 
Section 3.7.) 

The alternative choice 7r(a) oc l/i/a will be later 
justified in Section 3.10 not as Jeffreys's (invariant) 
prior but as leading to a posterior defined for all 
observations, which is not the case of 7r(a) oc 1/a 
when X = 0, a fact overlooked by Jeffreys. Note that 
7r(a) oc 1/a can nonetheless be advocated by Jeffreys 
on the ground that the Poisson process derives from 
the exponential distribution, for which a is a scale 
parameter: e~"* represents the fraction of the atoms 
originally present that survive after time t. 

4.4 Normal Distribution 

When the sampling variance o"^ of a normal model 
AA(/i,a"^) is known, the posterior distribution associ- 
ated with a fiat prior is correctly derived as ^\xi,. . . , 
Xn ~ AA(x, a'^/n) (with the repeated difficulty about 
the use of a cr-finite measure as a probability) . Under 
the joint improper prior 

7r(/i, a) oc 1/a, 

the (marginal) posterior on fi is obtained as a Stu- 
dent's t 

T{n — 1, x, s^ /n{n — 1)) 

distribution, while the marginal posterior on o"^ is 
an inverse gamma IQ{{n — l)/2, s^/2).^^ 



'^''A prior weight of 2A; = 1/2 is reasonable since it gives ^^Section 3.41 also contains the interesting remark that, 
equal probability to both hypotheses. conditional on two observations, xi and X2, the posterior 
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Jeffreys notices that, when n = l, the above prior 
does not lead to a proper posterior since 7r(/i|a;i) oc 
l/\fi — xi\ is not integrable, but he concludes that 
the solution degenerates in the right way, which, we 
suppose, is meant to say that there is not enough 
information in the data. But, without further for- 
malization, it is a delicate conclusion to make. 

Under the same noninformative prior, the predic- 
tive density of a second sample with sufficient statis- 
tic {x2,S2) is found"'^^ to be proportional to 



o ■ 



nisj + n2s1 + 



711712 
Ui + 712 



{X2 - Xl)' 



-(ni+n2-l)/2 



A direct conclusion is that this implies that X2 and 
S2 are dependent for the predictive, if independent 
given fi and a, while the marginal predictives on X2 
and S2 are Student's t and Fisher's z, respectively. 
Extensions to the prediction of multiple future sam- 
ples with the same (Section 3.43) or with different 
(Section 3.44) means follow without surprise. In the 
latter case, given m samples of Ur {1 <r < m) nor- 
mal AA(/ij,fT^) measurements, the posterior on cr^ 
under the noninformative prior 



7r(^i 



, /ir, O") oc l/fJ 



IS again an inverse gamma Xg(z^/2,sV2) distribu- 



and V = Yjr 



Ur 



tion,^'' with s'^ = '^i{xr 
while the posterior on t = y/nidii — Xi)/s is a Stu- 
dent's t with v degrees of freedom for all i's (no 
matter what the number of observations within this 
group is). Figure 1 represents the posteriors on the 
means /ij for the data set analyzed in this section on 
seven sets of measurements of the gravity. A para- 



probability that jj. is between both observations is exactly 1/2. 
Jeffreys attributes this property to the fact that the scale a is 
directly estimated from those two observations under a nonin- 
formative prior. Section 3.8 generalizes the observation to all 
location-scale families with median equal to the location. Oth- 
erwise, the posterior probability is less than 1/2. Similarly, the 
probability that a third observation will be between xi and 
X2 is equal to 1/3 under the predictive. While Jeffreys gives 
a proof by complete integration, this is a direct consequence 
of the exchangeability of a;i, X2 and X3. Note also that this is 
one of the rare occurrences of a credible interval in the book. 

^^In the current 1961 edition, n2S2 is mistakenly typed as 
n2S2 in equation (6) of Section 3.42. 

Jeffreys does not use the term "inverse gamma distribu- 
tion" but simply notes that this is a distribution with a scale 
parameter that is given by a single set of tables (for a given v). 
He also notices that the distribution of the transform log((j/s) 
is closer to a normal distribution than the original. 



i 



977.0 



977,2 



977,4 



977.6 



Fig. 1. Seven posterior distributions on the values of accel- 
eration due to gravity (in cm/sec^) at locations in East Africa 
when using a noninformative prior. 



graph in Section 3.44 contains hints about hierar- 
chical Bayes modeling as a way of strengthening es- 
timation, which is a perspective later advanced in 
favor of this approach (Lindley and Smith, 1972; 
Berger and Robert, 1990). 

The extension in Section 3.5 to the setting of the 
normal linear regression model should be simple (see, 
e.g., Marin and Robert, 2007, Chapter 3), except 
that the use of tensorial conventions — like when a 
suffix i is repeated it is to be given all values from 1 
to 771 — and the absence of matrix notation makes the 
reading quite arduous for today's readers. Because 
of this lack of matrix tools, Jeffreys uses an implicit 
diagonalization of the regressor matrix X'^X (with 
modern notation) and thus expresses the posterior 
in terms of the transforms of the regression co- 
efficients This section is worth reading if only 
to realize the immense advantage of using matrix 
notation. The case of regression equations 



Vi = Xij3 + £i 



'AA(0,a2 



with different unknown variances leads to a poly- 
t output (Bauwens, 1984) under a noninformative 
prior, which is deemed to be a complication, and 
Jeffreys prefers to revert to the case when af = WiCJ^ 
with known tjj's.^^ The final part of this section 



^*Using the notation Ci for yi, Xi for Pi, yi for (]i and air 
for Xir certainly makes reading this part more arduous. 

^^Sections 3.53 and 3.54 detail the numerical resolution of 
the normal equations by iterative methods and have no real 
bearing on modern Bayesian analysis. 
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mentions the interesting subcase of estimating a nor- 
mal mean a when truncated at a = 0: negative ob- 
servations do not need to be rejected since only the 
posterior distribution has to be truncated in 0. [In 
a similar spirit, Section 3.6 shows how to process a 
uniform l/({a — a,a + a) distribution under the non- 
informative 7r(a,a") = 1/cr prior.] 

Section 3.9 examines the estimation of a two-dimen- 
sional covariance matrix 

ygar 

under centred normal observations. The prior advo- 
cated by Jeffreys is '/r(r, cr, g) (xl/ra, leading to the 
(marginal) posterior 

7r{Q\Q,n) 

'I - p^Yl'^ 



r2 



(1 



(cosh/3 — QQ)"^ 



(1 - £.^)"-V2 



-{\-{\-rQQ)ul2Y^I''du 

that only depends on g. (Jeffreys notes that, when a 
and T are known, the posterior of g also depends on 
the empirical variances for both components. This 
paradoxical increase in the dimension of the suffi- 
cient statistics when the number of parameters is de- 
creasing is another illustration of the limited mean- 
ing of marginal sufficient statistics pointed out by 
Basu, 1988.) While this integral can be computed 
via confluent hypergeometric functions (Gradshteyn 
and Ryzhik, 1980), 



X 



,n-l 



: du 



10 ^/viT^^au) 

= 5(l/2,n)2Fi{l/2, 1/2; n + 1/2; (1 + gg)/2}, 

the corresponding posterior is certainly less manage- 
able than the inverse Wishart that would result from 
a power prior \Q\'^ on the matrix itself. The ex- 
tension to noncentred observations with flat priors 
on the means induces a small change in the outcome 
in that 

^1 _ ^2)(n-l)/2 



TT{g\g, n) oc 



(1 - £.^)"-3/2 



\n-2 



2u 



■ {1 - {1 + gg)u/2y^/'^ du, 



which is also the posterior obtained directly from the 
distribution of g. Indeed, the sampling distribution 
is given by 

2 



fiQlQ) 



n ■ 



2tt 
1 



(1 



-2Un-i)/2 



^2N(n-l)/2 



r(n- 1) 



' r(n-l/2) 
. (1 _ ^^)"("-3/2) 

•2Fi{l/2,l/2;n-l/2;(l + ^>^)/2}. 

There is thus no marginalization paradox (Dawid, 
Stone and Zidek, 1973) for this prior selection, while 
one occurs for the alternative choice 7r(r, a, g) oc l/r^cr^. 

4.5 SufFiciency and Exponential Families 

Section 3.7 generalizes^'^ observations made pre- 
viously about sufficient statistics for particular dis- 
tributions (Poisson, multinomial, normal, uniform). 
If there exists a sufficient statistic T{x) when x ~ 
f{x\a), the posterior distribution on a only depends 
on T{x) and on the number n of observations.^^ The 
generic form of densities from exponential families 

log f{x\a) = (x — a)/i'(a) + /i(a) + ipi^) 

is obtained by a convoluted argument of imposing x 
as the MLE of a, which is not equivalent to requiring 
X to be sufficient. The more general formula 

f{x\ai, . . .,am) 

m 

= <l){ai,...,am)ip{x)exp'^ Us {a)vs ( 

s=l 



xj 



is provided as a consequence of the (then very re- 
cent) Pitman~Koopman[-Darmois] theorem22 on the 
necessary and sufficient connection between the ex- 
istence of fixed dimensional sufficient statistics and 
exponential families. The theorem as stated does 
not impose a fixed support on the densities f{x\a) 
and this invalidates the necessary part, as shown 
in Section 3.6 with the uniform distribution. It is 



^''Jeffreys's derivation remains restricted to the unidimen- 
sional case. 

^'^Stating that n is an ancillary statistic is both formally 
correct in Fisher's sense (n does not depend on a) and am- 
biguous from a Bayesian perspective since the posterior on a 
depends on n. 

^^Darmois (1935) published a version (in French) of this 
theorem in 1935, about a year before both Pitman (1936) 
and Koopman (1936). 
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only later in Section 3.6 that parameter-dependent 
supports are mentioned, with an unclear conclusion. 
Surprisingly, this section does not contain any in- 
dication that the specific structure of exponential 
families could be used to construct conjugate^'^ pri- 
ors (Raiffa, 1968). This lack of connection with reg- 
ular priors highlights the fully noninformative per- 
spective advocated in Theory of Probability, despite 
comments (within the book) that priors should re- 
flect prior beliefs and/or information. 

4.6 Predictive Densities 

Section 3.8 contains the rather amusing and not 
well-known result that, for any location-scale para- 
metric family such that the location parameter is 
the median, the posterior probability that the third 
observation lies between the first two observations 
is 1/2. This may be the first use of Bayesian predic- 
tive distributions, that is, p{x3\xi,X2) in this case, 
where parameters are integrated out. Such predic- 
tive distributions cannot be properly defined in fre- 
quentist terms; at best, one may take p{x3\6 = 6) 
where is a plug-in estimator. Building more sen- 
sible predictives seems to be one major appeal of 
the Bayesian approach for modern practitioners, in 
particular, econometricians. 

4.7 Jeffreys's Priors 

Section 3.10 introduces Fisher information as a 
quadratic approximation to distributional distances. 
Given the Hellinger distance and the KuUback- 
Leibler divergence. 



di(P,P')= / \{dPf'^-{dP') 



y\i/2|2 



and 



I' dP 
d,{P,P') = j log — d(P-P'), 

we have the second-order approximations 



diiPa,Pa') ~ |(« - a'flia)ia - a') 



and 



d2{Pa,Pa') w (a - a')^I{a){a - a') 



where 



/(a) = 



df{x\a) df{x\a) ' 
da da 



As pointed to us by Dennis Lindley, Section 1.7 comes 
close to the concept of exchangeability when introducing 
chances. 



is Fisher information.^^ A first comment of impor- 
tance is that I{a) is equivariant under reparameter- 
ization, because both distances are functional dis- 
tances and thus invariant for all nonsingular trans- 
formations of the parameters. Therefore, if a' is a 
(differentiable) transform of a, 



I{a') 



da ^ da^ 
da' da' 



and this is the spot where Jeffreys states his general 
principle for deriving noninformative priors (Jeffreys's 
priors) 

7r(a)oc|I(a)|^/2 

is thus an ideal prior in that it is invariant under 
any (differentiable) transformation. 

Quite curiously, there is no motivation for this 
choice of priors other than invariance (at least at this 
stage) and consistency (at the end of the chapter). 
Fisher information is only perceived as a second or- 
der approximation to two functional distances, with 
no connection with either the curvature of the like- 
lihood or the variance of the score function, and no 
mention of the information content at the current 
value of the parameter or of the local discriminating 
power of the data. Finally, no connection is made 
at this stage with Laplace's approximation (see Sec- 
tion 4.0). The motivation for centering the choice 
of the prior at I{a) is thus uncertain. No mention 
is made either of the potential use of those func- 
tional distances as intrinsic loss functions for the 
[point] estimation of the parameters (Le Cam, 1986; 
Robert, 1996). However, the use of these intrinsic 
divergences (measures of discrepancy) to introduce 
1(a) as a key quantity seems to indicate that Jeffreys 
understood I{a) local discriminating power of 
the model and to some extent as the intrinsic fac- 
tor used to compensate for the lack of invariance 
of \a — a'p. It corroborates the fact that Jeffreys's 
priors are known to behave particularly well in one- 
dimensional cases. 

Immediately, a problem associated with this generic 
principle is spotted by Jeffreys for the normal distri- 
bution AA(/i,c7^). While, when considering /i and a 



Jeffreys uses an infinitesimal approximation to derive 
/(a) in Theory of Probability, which is thus not defined this 
way, nor connected with Fisher. 

Obviously, those priors are not called Jeffreys's priors in 
the book but, as a counter-example to Steve Stigler's law of 
epommy (Stigler, 1999), the name is now correctly associated 
with the author of this new concept. 
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separately, one recovers the invariance priors vr(;u) oc 
1 and 7r((T) oc l/u, Jeffreys's prior on the pair {fJ,,a) 
is 7r(;U,cr) oc l/cr^. If, instead, m normal observa- 
tions with the same variance cr^ were proposed, they 
would lead to 7r(/ii, . . . , /i^i c) oc l/a"^~^^, which is 
unacceptable (because it induces a growing depar- 
ture from the true value as m increases). Indeed, if 
one considers the likelihood 

m 

OC cj-'"'^ exp ^{(x, - iiif + s'j}, 



i=l 



the marginal posterior on a is 



a 



-1 ^ \ ^ 2 



i=l 



that is, 



and 



Qal {mn 



l)/2,n^ 



Er, 



mn ■ 



whose own expectation is 



mn — m 



mn 



1 



^0) 



if (To denotes the "true" standard deviation. If n is 
small against m, the bias resulting from this choice 
will be important. Therefore, in this special case, 
Jeffreys proposes a departure from the general rule 
by using 7r(/x,cj) oc 1/a. (There is a further men- 
tion of difficulties with a large number of param- 
eters when using one single scale parameter, with 
the same solution proposed. There may even be an 
indication about reference priors at this stage, when 
stating that some transforms do not need to be con- 
sidered.) 
The arc-sine law on probabilities, 



1 



7r(p) = - 



1 



■^^As pointed out to us by Lindley (2008, private commu- 
nication), JefTreys expresses more clearly the difficulty that 
the corresponding t distribution would always he [of index] 

+ l)/2, no matter how many true values were estimated, 
that is, that the natural reduction of the degrees of freedom 
with the number of nuisance parameters does not occur with 
this prior. 



is found to be the corresponding reference distribu- 
tion, with a more severe criticism of the other dis- 
tributions (see Section 4.1): both the usual rule and 
Haldane 's rule are rather unsatisfactory. The corre- 
sponding Dirichlet P(l/2, . . . , 1/2) prior is obtained 
on the probabilities of a multinomial distribution. 
Interestingly too, Jeffreys derives most of his priors 
by recomputing the L2 or Kullback distance and by 
using a second-order approximation, rather than by 
following the genuine definition of the Fisher infor- 
mation matrix. Because Jeffreys's prior on the Pois- 
son 'P(A) parameter is vr(A) oc 1/\/A, there is some 
attempt at justification, with the mention that gen- 
eral rules for the prior probability give a starting 
point, that is, act like reference priors (Berger and 
Bernardo, 1992). 

In the case of the (normal) correlation coefficient, 
the posterior corresponding to Jeffreys's prior 
7r{Q,T,a) oc l/rcj(l — q^Y^"^ is not properly defined 
for a single observation, but Jeffreys does not ex- 
pand on the generic improper nature of those prior 
distributions. In an attempt close to defining a refer- 
ence prior, he notices that, with both r and a fixed, 
the (conditional) prior is 



tt{q) oc 



which, while improper, can also be compared to the 
arc-sine prior 

1 1 



tt{q) 



IT 



which is integrable as is. Note that Jeffreys does not 
conclude in favor of one of those priors: We cannot 
really say that any of these rules is better than the 
uniform distribution. 

In the case of exponential families with natural 
parameter /3, 

f{x\^)=^^^{x)4>{P)exp^v{x), 

Jeffreys does not take advantage of the fact that 
Fisher information is available as a transform of (p, 
indeed, 

I{f3) = dHogm/df3^ 

but rather insists on the invariance of the distri- 
bution under location-scale transforms, /? = k(3' + 
/, which does not correctly account for potential 
boundaries on /3. 

Somehow, surprisingly, rather than resorting to 
the natural "Jeffreys's prior," tt{/3) oc log(/>(/3)/ 
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5/3^1^/^, Jeffreys prefers to use the "standard" flat, 
log-flat and symmetric priors depending on the range 
of (3. He then goes on to study the alternative of 
defining the noninformative prior via the mean pa- 
rameterization suggested by Huzurbazar (see Huzur- 
bazar, 1976), 

^l{f3) = j v{x)f{x\f3)dx. 

Given the overall invariance of Jeffreys's priors, this 
should not make any difference, but Jeffreys chooses 
to pick priors depending on the range of For 
instance, this leads him once again to promote the 
Dirichlet P(l/2,l/2) prior on the probability p of 
a binomial model if considering that logp/(l — p) 
is unbounded, and the uniform prior if consider- 
ing that fi{p) = np varies on (0,oo). It is interesting 
to see that, rather than sticking to a generic prin- 
ciple inspired by the Fisher information that Jef- 
freys himself recognizes as consistent and that offers 
an almost universal range of applications, he resorts 
to group invariant (Haar) measures when the rule, 
though consistent, leads to results that appear to dif- 
fer too much from current practice. 

We conclude with a delicate example that is found 
within Section 3.10. Our interpretation of a set of 
quantitative laws (pr with chances ar [such that] if(j)r 
is true, the chance of a variable x being in a range 
dx is fr{x,ari, ■ ■ ■ ,arn) dx is that of a mixture of 
distributions, 

m 

, CXrl , . ■ • , Ctrn ) ■ 

r=l 

Because of the complex shape (convex combination) 
of the distribution, the Fisher information is not 
readily available and Jeffreys suggests assigning a 
reference prior to the weights (ai, . . . ,am), that is, 
a Dirichlet P(l/2, . . . , 1/2), along with separate ref- 
erence priors on the ars- Unfortunately, this leads 
to an improper posterior density (which integrates 
to infinity). In fact, mixture models do not allow for 
independent improper priors on their components 
(Marin, Mengersen and Robert, 2005). 

5. CHAPTER IV: APPROXIMATE METHODS 
AND SIMPLIFICATIONS 

The difference made by any ordinary change of the 
prior probability is comparable with the effect 



■^'^There is another typo when stating that logp/(l — p) 
ranges over (0,00). 



of one extra observation. 
H. Jeffreys, Theory of Probability, Section 4.0. 

As in Chapter II, many points of this chapter are 
outdated by modern Bayesian practice. The main 
bulk of the discussion is about various approxima- 
tions to (then) intractable quantities or posteriors, 
approximations that have limited appeal nowadays 
when compared with state-of-the-art computational 
tools. For instance. Sections 4.43 and 4.44 focus on 
the issue of grouping observations for a linear regres- 
sion problem: if data is gathered modulo a round- 
ing process [or if a polyprobit model is to be esti- 
mated (Marin and Robert, 2007)], data augmenta- 
tion (Tanner and Wong, 1987; Robert and Casella, 
2004) can recover the original values by simulation, 
rather than resorting to approximations. Mentions 
are made of point estimators, but there is unfortu- 
nately no connection with decision theory and loss 
functions in the classical sense (DeGroot, 1970; Berger, 
1985). A long section (Section 4.7) deals with rank 
statistics, containing apparently no connection with 
Bayesian Statistics, while the final section (Section 4.9) 
on randomized designs also does not cover the spe- 
cial issue of randomization within Bayesian Statis- 
tics (Berger and Wolpert, 1988). 

The major components of this chapter in terms 
of Bayesian theory are an introduction to Laplace's 
approximation, although not so-called (with an in- 
teresting side argument in favor of Jeffreys's priors), 
some comments on orthogonal parameterisation [un- 
derstood from an information point of view] and the 
well-known tramcar example. 

5.1 Laplace's Approximation 

When the number of observations n is large, the 
posterior distribution can be approximated by a Gaus- 
sian centered at the maximum likelihood estimate 
with a range of order n~^/^. There are numerous 
instances of the use of Laplace's approximation in 
Bayesian literature (see, e.g., Berger, 1985; MacKay, 
2002), but only with specific purposes oriented to- 
ward model choice, not as a generic substitute. Jef- 
freys derives from this approximation an incentive 
to treat the prior probability as uniform since this 
is of no practical importance if the number of obser- 
vations is large. His argument is made more precise 
through the normal approximation, 

L{e\xi,. . . ,Xn) 

« L{e\x) oc exp{-n{9 - efl{0){e - 9)/2}, 
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to the hkehhood. [Jeffreys notes that it is of trivial 
importance whether I{9) is evaluated for the actual 
values or for the MLE 9] Since the normahzation 
factor is 

(n/2^)™/2|/(^)|i/2^ 

using Jeffreys's prior 7r(^) oc [/(S)!"*^/^ means that 
the posterior distribution is properly normalized and 
that the posterior distribution of 9i — 9i is nearly 
the same (. . .) whether it is taken on data 9i or 
on 9i. This sounds more like a pivotal argument in 
Fisher's fiducial sense than genuine Bayesian rea- 
soning, but it nonetheless brings an additional ar- 
gument for using Jeffreys's prior, in the sense that 
the prior provides the proper normalizing factor. Ac- 
tually, this argument is much stronger than it first 
looks in that it is at the very basis of the construc- 
tion of matching priors (Welch and Peers, 1963). In- 
deed, when considering the proper normalizing con- 
stant (vr(6') cx \I{9)\^/'^), the agreement between the 
frequentist distribution of the maximum likelihood 
estimator and the posterior distribution of 6 gets 
closer by an order of 1. 

5.2 Outside Exponential Families 

When considering distributions that are not from 
exponential families, sufficient statistics of fixed di- 
mension do not exist, and the MLE is much harder 
to compute. Jeffreys suggests in Section 4.1 using a 
minimum approximation to overcome this diffi- 
culty, an approach which is rarely used nowadays. 

A particular example is the poly-i (Bauwens, 1984) 
distribution 

(^_^^) 2'|-K+l)/2 



7r(^|xi, . . .,Xs) OC 1 + 



r=l 

that happens when several series of observations yield 
independent estimates [xrj of the same true value 
[fj,]. The difficulty with this posterior can now be 
easily solved via a Gibbs sampler that demarginal- 
izes each t density. 

Section 4.3 is not directly related to Bayesian Statis- 
tics in that it is considering (best) unbiased esti- 
mators, even though the Rao~Blackwell theorem is 
somehow alluded to. The closest connection with 
Bayesian Statistics could be that, once summary 
statistics have been chosen for their availability, a 
corresponding posterior can be constructed condi- 
tional on those statistics. The present equivalent 



of this proposal would then be to use variational 
methods (Jaakkola and Jordan, 2000) or ABC tech- 
niques (Beaumont, Zhang and Balding, 2002). 

An interesting insight is given by the notion of 
orthogonal parameters in Section 4.31, to be under- 
stood as the choice of a parameterization such that 
I{9) is diagonal. This orthogonalization is central in 
the construction of reference priors (Kass, 1989; Tib- 
shirani, 1989; Berger and Bernardo, 1992; Berger, 
Philippe and Robert, 1998) that are identical to Jef- 
freys's priors. Jeffreys indicates, in particular, that 
full orthogonalization is impossible for m = 4 and 
more dimensions. 

In Section 4.42 the errors-in-variables model is 
handled rather poorly, presumably because of com- 
putational difficulties: when considering (1 <r <n) 

yr = ai + (3 + er, Xr = i + e'^, 

the posterior on (a, (3) under standard normal errors 
is 

■K{a,li\{xi,yi),. . . ,{xn,yx)) 

n 

{Vr - aXr - /3)^ 1 



■^^A side comment on the first-order symmetry between the 
probability of a set of statistics given the parameters and that 



OC 



r=l 



• exp< 



E 

r=l 



2(t2 + a2s2) 



which induces a normal conditional distribution on 
/3 and a more complex t-\ike marginal posterior dis- 
tribution on a that can still be processed by present- 
day standards. 

Section 4.45 also contains an interesting example 
of a normal A/'(/i,cr^) sample when there is a known 
contribution to the standard error, that is, when 
o"^ > o"'^ with a' known. In that case, using a flat 
prior on log{a^ — cr'^) leads to the posterior 

7r(/i, a\x, s^, n) 



1 



1 



oc 



exp 



n 

2^ 



which integrates out over fx to 
1 1 



7r{a\s , n) oc 



V2^«-P 



ns 
2^ 



21 



The marginal obviously has an infinite mode (or 
pole) at a = a', but there can be a second (and 



of the parameters given the statistics seems to precede tlie 
first-order symmetry of the (posterior and frequentist) confi- 
dence intervals established in Welch and Peers (1963). 
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Fig. 2. Posterior distribution 7r(crjs^,n) for a' — \/2, n — 15 
and ns'^ = 100, when using the prior Tr{jj.,a) oc l/o" (blue 



curve) and the prior 71(11, a) ocl/a 



(brown curve). 



meaningful) mode if is large enough, as illustrated 
on Figure 2 {brown curve). The outcome is indeed 
different from using the truncated prior 7r(/x,a) oc 
1/cr {blue curve)., but to conclude that the infer- 
ence using this assessment of the prior probability 
would be that a = a' \s based once again on the false 
premise that infinite mass posteriors act like Dirac 
priors, which is not correct: since 7r((T|s^,?i) does not 
integrate over a = cj', the posterior is simply not de- 
fined. In that sense, Jeffreys is thus right in reject- 
ing this prior choice as absurd. 

5.3 The Tramcar Problem 

This chapter contains (in Section 4.8) the now 
classical "tramway problem" of Newman, about a 
man traveling in a foreign country [who] has to change 
trains at a junction, and goes into the town, the exis- 
tence of which he has only just heard. He has no idea 
of its size. The first thing that he sees is a tramcar 
numbered 100. What can he infer about the num- 
ber of tramcars in the town ? It may be assumed that 
they are numbered consecutively from 1 upwards. 

This is another illustration of the standard non- 
informative prior for a scale, that is, 7r(n) oc 
where n is the number of tramcars; the posterior 
satisfies 7r(n|m = 100) oc l/n^I(n > 100) and 



(n > nQ\m) 



J2 r' 

r=no+l 



, 00 
r=m 



m 
no' 



Therefore, the posterior median (the justification 
of which as a Bayes estimator is not included) is 
approximately 2m. Although this point is not dis- 
cussed by Jeffreys, this example is often mentioned 
in support of the Bayesian approach against the 
MLE, since the corresponding maximum estimator 
of n is m, always below the true value of n, while 
the Bayes estimator takes a more reasonable value. 

6. CHAPTER V: SIGNIFICANCE TESTS: ONE 
NEW PARAMETER 

The essential feature is that we express ignorance 
of whether the new parameter is needed by taking 
half the prior probability for it as concentrated in 
the value indicated by the null hypothesis and 
distributing the other half over the range possible. 
H. Jeffreys, Theory of Probability, Section 5.0. 

This chapter (as well as the following one) is con- 
cerned with the central issue of testing hypotheses, 
the title expressing a focus on the specific case of 
point null hypotheses: Is the new parameter sup- 
ported by the observations, or is any variation ex- 
pressible by it better interpreted as random?^^ The 
construction of Bayes factors as natural tools for 
answering such questions does require more math- 
ematical rigor when dealing with improper priors 
than what is found in Theory of Probability. Even 
though it can be argued that Jeffreys's solution (us- 
ing only improper priors on nuisance parameters) is 
acceptable via a limiting argument (see also Berger, 
Pericchi and Varshavsky, 1998, for arguments based 
on group invariance), the specific and delicate fea- 
ture of using infinite mass measures would deserve 
more validation than what is found there. The dis- 
cussion on the choice of priors to use for the param- 
eters of interest is, however, more rewarding since 
Jeffreys realizes that (point estimation) Jeffreys's 
priors cannot be used in this setting (because of 
their improperness) and that an alternative class of 
(testing) Jeffreys's priors needs to be introduced. It 
seems to us that this second type of Jeffreys's pri- 
ors has been overlooked in the subsequent literature, 
even though the specific case of the Cauchy prior 
is often pointed out reference prior for testing 
point null hypotheses involving location parameters. 



■^^For an example of a constant MAP estimator, see Robert 
(2001, Example 4.2). 



■^^The formulation of the question restricts the test to em- 
bedded hypotheses, even though Section 5.7 deals with nor- 
mality tests. 
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6.1 Model Choice Formalism 

Jeffreys starts by analyzing the question, 

In what circumstances do observations sup- 
port a change of the form of the law it- 
self?, 

from a model-choice perspective, by assigning 
prior probabilities to the models QJIj that are in 
competition, 7r(9Jti) (i = l,2,...). He further con- 
strains those probabilities to be terms of a conver- 
gent series.^^ When checking back in Chapter I (Sec- 
tion 1.62), it appears that this condition is due to the 
constraint that the probabilities can be normalized 
to 1, which sounds like an unnecessary condition if 
dealing with improper priors at the same time.'^^ 
The consequence of this constraint is that '7r(9Jtj) 
must decrease like 2~* or i~'^ and it thus (a) pre- 
vents the use of equal probabilities advocated before 
and (b) imposes an ordering of models. 

Obviously, the use of the Bayes factor eliminates 
the impact of this choice of prior probabilities, as 
it does for the decomposition of an alternative hy- 
pothesis Hi into a series of mutually irrelevant alter- 
native hypotheses. The fact that m alternatives are 
tested at once induces a Bonferroni effect, though, 
that is not (correctly) taken into account at the be- 
ginning of Section 5.04 (even if Jeffreys notes that 
the Bayes factor is then multiplied by 0.7m). The 
following discussion borders more on "ranking and 
selection" than on testing per se, although the use 
of Bayes factors with correction factor m or is 
the proposed solution. It is only at the end of Sec- 
tion 5.04 that the Bonferroni effect of repeated test- 
ing is properly recognized, if not correctly solved 
from a Bayesian point of view. 

If the hypothesis to be tested is Hq:6 = 0, against 
the alternative Hi that is the aggregate of other pos- 
sible values [of 6], Jeffreys initiates one of the major 
advances of Theory of Probability by rewriting the 
prior distribution mixture of a point mass in 
9 = and of a generic density vr on the range of 9, 

7T{9) = ^U9) + ^7r{9). 



^^The perspective of an infinite sequence of models under 
comparison is not pursued further in this chapter. 

^^In Jeff'reys (1931), Jeffreys puts forward a similar argu- 
ment that it is impossible to construct a theory of quantitative 
inference on the hypothesis that all general laws have the same 
prior probability (Section 4.3, page 43). See Earman (1992) for 
a deeper discussion of this point. 



This is indeed a stepping stone for Bayesian Statis- 
tics in that it explicitly recognizes the need to sep- 
arate the null hypothesis from the alternative hy- 
pothesis within the prior, lest the null hypothesis 
is not properly weighted once it is accepted. The 
overall principle is illustrated for a normal setting, 
a;~AA(0,(T^) (with known cj^), so that the Bayes 
factor is 

7r{Hi\x)/ 7r{Hi) 
_ exp{-a;V2(T2} 
~ J f{9)exp{-{x-9)y2a^}d9' 

The numerical calibration of the Bayes factor is not 
directly addressed in the main text, except via a 
qualitative divergence from the neutral K = 1. Ap- 
pendix B provides a grading of the Bayes factor, as 
follows: 

• Grade 0. K > 1. Null hypothesis supported. 

• Grade 1. 1 > K > 10^"^^^. Evidence against Hq, 
but not worth more than a bare mention. 

• Grade 2. 10"^/^ > K > 10^^. Evidence against Hq 
substantial. 

• Grade 3. 10^^ > K > 10"'^/^. Evidence against Hq 
strong. 

• Grade 4. 10"'^/^ > K > 10^^. Evidence against Hq 
very strong. 

• Grade 5. 10^^ > K >. Evidence against Hq deci- 
sive. 

The comparison with the ^^'^ i statistics in this 
appendix shows that a given value of K leads to an 
increasing (in n) value of those statistics, in agree- 
ment with Lindley's paradox (see Section 6.3 below). 

If there are nuisance parameters in the model 
(Section 5.01), Jeffreys suggests using the same prior 
on ^ under both alternatives, tto{£,), resulting in the 
general Bayes factor 

K = l ^o(0/N^,0)de 

/ J MOM0\Ofix\C,0)dCd9, 

where '7ri(^|^) is a conditional density. Note that Jef- 
freys uses a normal model with Laplace's approxi- 
mation to end up with the approximation 
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where and ^ are the MLEs of 6 and ^, and where 
ggg is the component of the information matrix cor- 
responding to 9 (under the assumption of strong 
orthogonahty between 6 and ^, which means that 
the MLE of ^ is identical in both situations). The 
low impact of the choice of ttq on the Bayes fac- 
tor may be interpreted as a licence to use improper 
priors on the nuisance parameters despite difficul- 
ties with this approach (DeGroot, 1973). An inter- 
esting feature of this proposal is that the nuisance 
parameters are processed independently under both 
alternatives/models but with the same prior, with 
the consequence that it makes little difference to K 
whether we have much or little information about 
^.33 ■\Yhen the nuisance parameters and the param- 
eter of interest are not orthogonal, the MLEs and 
.^1 differ and the approximation of the Bayes factor 
is now 



K'. 



7ro(Co) 



1 



7ro(Ci) 7ri(6'|^i 



ngee 
2-K 



■ exp 



1 



which shows that the choice of ttq may have an in- 
fluence too. 

6.2 Prior Modeling 

In Section 5.02 Jeffreys perceives the difficulty in 
using an improper prior on the parameter of interest 
normalization problem. If one picks 7r(^) or 
7ri(^|^) as a cj-finite measure, the Bayes factor K is 
undefined (rather than always infinite, as put for- 
ward by Jeffreys when normalizing by oo). He thus 
imposes tt{6) to be of any form whose integral con- 
verges (to 1, presumably), ending up in the location 
case^^ suggesting a Cauchy C(0,(t^) prior as it{6). 

The first example fully processed in this chapter is 
the innocuous B{n,p) model with Hq:p = pq, which 
leads to the Bayes factor 



K 



(n + 1)! 



■pS(i-po) 



x\{n — x)\ 

under the uniform prior. While K = 1 \s recognized 
as a neutral value, no scaling or calibration of K 



^^The requirement that = when 9 = (where ^' denotes 
the nuisance parameter under Hi ) seems at first meaningless, 
since each model is processed independently, but it could sig- 
nify that the parameterization of both models must be the 
same when 9 = 0. Otherwise, assuming that some parame- 
ters are the same under both models is a source of contention 
within the Bayesian literature. 

^''Note that the section seems to consider only location pa- 
rameters. 



is mentioned at this stage for reaching a decision 
about Hq when looking at K. The only comment 
worth noting there is that K is not very decisive 
for small values of n: we cannot get decisive results 
one way or the other from a small sample (without 
adopting a decision framework). The next example 
still sticks to a compact parameter space, since it 
deals with the 2x2 contingency table. The null hy- 
pothesis Hq is that of independence between both 
factors, Hq :piiP22 =Pi2P2i- The reparameterization 
in terms of the margins is 





1 




1 




a(i-/i)-7 


2 


(l-a)/3-7 


(l-a)(l-/3)+7 



but, in order to simplify the constraint 

-min{a/3,(l -a)(l -/?)} 

< 7 < min{a(l - /3), (1 - a)/?}, 

Jeffreys then assumes that a < (3 < 1/2 via a mere 
rearrangement of the table. In this case, 7r{'j\a,/3) = 
1/a over (— a/S, q;(1 — /3)). Unfortunately, this as- 
sumption (of being able to rearrange) is not realistic 
when a and /3 are unknown and, while the author 
notes that in ranges where a is not the smallest, it 
must be replaced in the denominator [of TT{'y\a, /3)] 
by the smallest, the subsequent derivation keeps us- 
ing the constraint a < /3 < 1/2 and the denominator 
a in the conditional distribution of 7, acknowledg- 
ing later that an approximation has been made in 
allowing a to range from to 1 since a < (3 < 1/2. 
Obviously, the motivation behind this crude approx- 
imation is to facilitate the computation of the Bayes 
factor, '^^ as 



(rei. -I- l)!?7,2.!n.i!n.2! 
nii!n22!rai2!n2i!(n -|- 1)! 



(n + 1) 



if the data is 





1 


2 




1 


nil 


ni2 


Hi. 


2 
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n.i 


n.2 


n 



The computation of the (true) marginal associ- 
ated with this prior (under Hi) is indeed involved 



Notice the asymmetry in ni. resulting from the approxi- 
mation. 
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Fig. 3. Comparison of a Monte Carlo approximation to the 
Bayes factor for the 2x2 contingency table with Jeffreys's 
approximation, based on 10'^ randomly generated 2x2 tables 
and 10* generations from the prior. 



and requires either formal or numerical machine- 
based integration. For instance, massively simulat- 
ing from the prior is sufficient to provide this ap- 
proximation. As shown by Figure 3, the difference 
between the Monte Carlo approximation and Jef- 
freys's approximation is not spectacular, even though 
Jeffreys's approximation appears to be always bi- 
ased toward larger values, that is, toward the null 
hypothesis, especially for the values of K larger than 
1. In some occurrences, the bias is such that it means 
acceptance versus rejection, depending on which ver- 
sion of K is used. 

However, if one uses instead a Dirichlet 25(1, 1, 1, 1) 
prior on the original parameterization (pn, . . . ,P22), 
the marginal is (up to the multinomial coefficient) 
the Dirichlet normalizing constant^^ 



7ni(n) oc 



L>(nii-H,...,n22 + 1) 
15(1,1,1,1) 

(n + 3)! 



3! 



nii!n22!ni2!n2i! ' 



so the (true) Bayes factor in this case is 
ni.!n2.!n.i!n.2! 3!(n-|-3)! 



K 



((n-|-l)!)2 nii!n22!rai2!n2i! 
ni.!n2.!n.i!n.2! 3!(n -|- 3)(n -|- 2) 
nii!n22!ni2!n2i! (n-|-l)! 



^^Note that using a Haldane (improper) prior is impossible 
in this case, since the normalizing constant cannot be elimi- 
nated. 



which is larger than Jeffreys's approximation. A ver- 
sion much closer to Jeffreys's modeling is based on 
the parameterization 



1 

"aJT 



{l-a)P 



(l-7)(l-/3) 



in which case a, /3 and 7 are not constrained by one 
another and a uniform prior on the three parameters 
can be proposed. After straightforward calculations, 
the Bayes factor is given by 



K={n + 1 



n.i!n.2!(ni. -H)!(n2. -H)! 

' (^ + 1: 



!nii!ni2!n2i!n22! 



which is very similar to Jeffreys's approximation 
since the ratio is (71-2. -|- l)/(?^ + 1)- Note that the 
alternative parameterization based on using 





1 


2 


1 






2 


(l-a)(l-/3) 


(l-a)(l-7) 



with a uniform prior provides a different answer 
(with Tij.'s and n.j's being inverted in K). Section 5.12 
reprocesses the contingency table with one fixed mar- 
gin, obtaining very similar outcomes. '^^ 

In the case of the comparison of two Poisson sam- 
ples (Section 5.15), V{X) and P(A'), the null hy- 
pothesis is Hq : a/A' = a/(l — a), with a fixed. This 
suggests the reparameterization 

A = a/3, A' = (l-a)/3', 

with Ho : a = a. This reparameterization appears to 
be strongly orthogonal in that 



K 



j7r{l3)a^{l-ay'/3'' 



x+x'-l3 



dp 



f 7r(/3)a^(l-a)^'/3^+^'e-/5 df3da 
a^(l - J 7r(/3)/3^+^'e-/^ d/3 



/ a^(l - a)^' da f 7r(/3)/3^+^'e-/5 df3 
a^(l-a)^' {x + x' + iy. 



f a^{l — a)^' da 



XlX 



a^(l-a)^ 



■^^An interesting example of statistical linguistics is pro- 
cessed in Section 5.14, with the comparison of genders in 
Welsh, Latin and German, with Freund's psychoanalytic sym- 
bols, whatever that means!, but the fact that both Latin and 
German have neuters complicated the analysis so much for 
Jeffreys that he did without the neuters, apparently unable 
to deal with 3x2 tables. 
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for every prior vr(/3), a rather unusual invariance 
property! Note that, as shown by (1), it also cor- 
responds to the Bayes factor for the distribution 
of X conditional on x + x' since this is a binomial 
B{x -\- x' ,a) distribution. The generalization to the 
Poisson case is therefore marginal since it still fo- 
cuses on a compact parameter space. 

6.3 Improper Priors Enter 

The bulk of this chapter is dedicated to testing 
problems connected with the normal distribution. It 
offers an interesting insight into Jeffreys's processing 
of improper priors, in that both the infinite mass 
and the lack of normalizing constant are not clearly 
signaled as potential problems in the book. 

In the original problem of testing the nullity of a 
normal mean, when xi, . . . , rE„ ~ A/'(/i, cr^), Jeffreys 
uses a reference prior 7ro{a) tx under the null 
hypothesis and the same reference prior augmented 
by a proper prior on fi under the alternative, 

7ri(/i,cr) oc -7rii(/j,/(T)-, 
a a 

where a is used as a scale for fi. The Bayes factor is 
then defined as 



K 



-^-^exp{-^{x^ + s^)}da 



oo r+oo 



-n-2 



■ exp 



n 

2^ 



da dfi 



without any remark on the use of an improper prior 
in both the numerator and the denominator.^^ There 
is therefore no discussion about the point of using an 
improper prior on the nuisance parameters present 
in both models, that has been defended later in, 
for example, Berger, Pericchi and Varshavsky (1998) 
with deeper arguments. The focus is rather on a ref- 
erence choice for the proper prior tth. Jeffreys notes 
that, if vTii is even, K = 1 when n = l, and he forces 
the Bayes factor to be zero when = and x j^O, 



^*If we extrapolate from earlier remarks by Jeffreys, his 
justification may be that the same normalizing constant 
(whether or not it is finite) is used in both the numerator 
and the denominator. 



by a limiting argument that a null empirical vari- 
ance implies that a = and thus that fi = x ^ 0. 
This constraint is equivalent to the denominator of 
K diverging, that is, 



f{v)v''-Uv 



oo. 



A solution^^ that works for all n > 2 is the Cauchy 
density, f{v) = l/7r(l advocated as such^'^ a 

reference prior by Jeffreys (while he criticizes the 
potential use of this distribution for actual data). 
While the numerator of K is available in closed form, 



a-"-iexp 



n 



da 



-n/2 



r(n/2). 



this is not the case for the denominator and Jeffreys 
studies in Section 5.2 some approximations to the 
Bayes factor, the simplest^^ being 

i^P.yW2(l + tV^)-^"+')/', 

where = n — 1 and t = ^x / s (which is the stan- 
dard t statistic with a constant distribution over v 
under the null hypothesis). Although Jeffreys does 
not explicitly delve into this direction, this approx- 
imation of the Bayes factor is sufficient to expose 
Lindley's paradox (Lindley, 1957), namely, that the 
Bayes factor K, being equivalent to Y^7rz^/2exp{— 1^/2}, 
goes to oo with u for a fixed value of t, thus high- 
lighting the increasing discrepancy between the fre- 
quentist and the Bayesian analyses of this testing 
problem (Berger and Sellke, 1987). As pointed out 
to us by Lindley (private communication) , the para- 
dox is sometimes called the Lindley- Jeffreys para- 
dox^ because this section clearly indicates that t in- 
creases like (logz/)"^/^ to keep K constant. 

The correct Bayes factor can of course be approx- 
imated by a Monte Carlo experiment, using, for in- 
stance, samples generated as 

2 ^ \n+l ns'^\ Krr- 2i \ 

a ^Qa{ — - — , — f and /u|ct ~ A/ (x, cr /n). 



There are obviously many other distributions that also 
satisfy this constraint. The main drawback of the Cauchy pro- 
posal is nonetheless that the scale of 1 is arbitrary, while it 
clearly has an impact on posterior results. 

*°Cauchy random variables occur in practice as ratios of 
normal random variables, so they are not completely implau- 
sible. 

''^The closest to an explicit formula is obtained just before 
Section 5.21 as a representation of K through a single integral 
involving a confluent hypergeometric function. 
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Fig. 4. Comparison of a Monte Carlo approximation to the 
Bayes factor for the normal mean problem with Jeffreys's ap- 
proximation, based on 5 X 10"^ randomly generated normal suf- 
ficient statistics with n = 10 and 10 Monte Carlo simulations 
of ill, a). 

The difference between the t approximation and the 
true value of the Bayes factor can be fairly impor- 
tant, as shown on Figure 4 for n = 10. As in Fig- 
ure 3, the bias is always in the same direction, the 
approximation penalizing Hq this time. Obviously, 
as n increases, the discrepancy decreases. (The up- 
per truncation on the cloud is a consequence of Jef- 
freys's approximation being bounded by -^J-kv /2.) 

The Cauchy prior on the mean is also a computa- 
tional hindrance when a is known: the Bayes factor 
is then 



K = exp{-nxV2cr^} 



— / exp 

VTfT 



n 



l+/i2/a2 

In this case, Jeffreys proposes the approximation 

1 



nn- 



which is then much more accurate, as shown by Fig- 
ure 5: the maximum ratio between the approximated 
K and the value obtained by simulation is 1.15 for 
n = 5 and the difference furthermore decreases as n 
increases. 

6.4 A Second Type of Jeffreys Priors 

In Section 5.3 Jeffreys makes another general pro- 
posal for the selection of proper priors under the 
alternative hypothesis: Noticing that the Kullback 
divergence is = /U^/o"^ in the normal case 




Fig. 5. Monte Carlo approximation to the Bayes factor for 
the normal mean problem with known variance, compared with 
Jeffreys's approximation, based on 10® Monte Carlo simula- 
tions of ^, when n = 5. 

above, he deduces that the Cauchy prior he proposed 
on /i is equivalent to a flat prior on arctan J^/^; 

da 1 I 
7 o , ON = = - 4tan~^ J^'^(a)}, 

and turns this coincidence into a general rule.^^ Jn 
particular, the change of variable from /x to J is 
not one-to-one, so there is some technical difficulty 
linked with this proposal: Jeffreys argues that J^^^ 
should be taken to have the same sign as // but this 
is not satisfactory nor applicable in general settings. 
Obviously, the symmetrization will not always be 
possible and correcting when the inverse tangents 
do not range from —tt/2 to n/2 can be done in many 
ways, thus making the idea not fully compatible 
with the general invariance principle at the core of 
Theory of Probability. Note, however, that Jeffreys's 
idea of using a functional of the Kullback-Leibler 
divergence (or of other divergences) as a reference 
parameterisation for the new parameter has many 
interesting applications. For instance, it is central to 
the locally conic parameterization used by Dacunha- 
Castelle and Gassiat (1999) for testing the number 
of components in mixture models. 

In the first case he examines, namely, the case of 
the contingency table, Jeffreys finds that the cor- 
responding Kullback divergence depends on which 



*^We were not aware of this rule prior to reading the book 
and this second type of Jeffreys's priors, judging from the 
Bayesian literature, does not seem to have inspired many fol- 
lowers. 
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Fig. 6. Jeffreys's reference density on log((T/(To) for the test 
of Ho -.0 = 00. 




leads to (modulo the improper change of variables) 
J(C) = 2sinh2(C) and 
1 dtan-i ^cosh(C) 



Fig. 7. Ratio of a Monte Carlo approximation to the Bayes 
factor for the normal variance problem and of Jeffreys 's ap- 
proximation, when n = 10 (based on 10* simulations). 

margins are fixed (as is well known, the Fisher in- 
formation matrix is not fully compatible with the 
Likelihood Principle, see Berger and Wolpert, 1988). 
Nonetheless, this is an interesting insight that pre- 
cedes the reference priors of Bernardo (1979): given 
nuisance parameters, it derives the (conditional) prior 
on the parameter of interest as the Jeffreys prior for 
the conditional information. See Bayarri and Garcia- 
Donato (2007) for a modern extension of this per- 
spective to general testing problems. 

In the case (Section 5.43) of testing whether a 
[normal] standard error has a suggested value ctq 
when observing ns^ ~ ^a(n/2, £7^/2), the parame- 
terization 



dC 7rcosh(2C) 

as a potential (and overlooked) prior on = log(cj/ 
(Tq).^^ The corresponding Bayes factor is not avail- 
able in closed form since 
cosh(C) 



cosh(2C) 
°^ l + n2 



e~"^ exp{-nsy2ay'}dC 







1 + M' 



-u exp 



2 

ns 9 



du 



cannot be analytically integrated, even though a 
Monte Carlo approximation is readily computed. Fig- 
ure 7 shows that Jeffreys's approximation. 



I -cosh(21ogs/o-o) 

V W2 TTi ^^(s/fJo 

cosh(logs/cro) 

•exp{n(l-(s/ao)2)/2}, 



is again fairly accurate since the ratio is at worst 
0.9 for n = 10 and the difference decreases as n in- 
creases. 

The special case of testing a normal correlation co- 
efficient : p = pq is not processed (in Section 5.5) 
via this general approach but, based on arguments 
connected with (a) the earlier difficulties in the con- 
struction of an appropriate noninformative prior (Sec- 
tion 4.7) and (b) the fact that J diverges for the null 
hypothesis^'* p = ±1, Jeffreys falls back on the uni- 
form U{—1, 1) solution, which is even more convinc- 
ing in that it leads to an almost closed-form solution 



K 



2(1 



/_^(l-p2)n/2/(i_^^)n-l/2^p 

Note that Jeffreys's approximation, 

2n^l (1 _ ^2)n/2(i _ ^2)(n-3)/2 

vr 



' (l-pp)"-V2 

is quite reasonable in this setting, as shown by Fig- 
ure 8, and also that the value of po has no influence 
on the ratios of the approximations. The extension 



a ■ 



''^Note that this is indeed a probability density, whose 
shape is given in Figure 6, despite the loose change of vari- 
ables, because a missing 2 cancels with a missing 1/2! 

**This choice of the null hypothesis is somehow unusual, 
since, on the one hand, it is more standard to test for no 
correlation, that is, p = Q, and, on the other hand, having 
p = ±1 is akin to a unit-root test that, as we know today, 
requires firmer theoretical background. 
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Fig. 8. Ratio of a Monte Carlo approximation to the Bayes 
factor for the normal variance problem and of Jeffreys 's ap- 
proximation, when n = 10 and po ~0 (based on 10^ simula- 
tions). 

to two samples in Section 5.51 (for testing whether 
or not the correlation is the same) is not processed 
in a symmetric way, with some uncertainty about 
the validity of the expression for the Bayes factor: 
a pseudo-common correlation is defined under the 
alternative in accordance with the rule that the pa- 
rameter p must appear in the statement of Hi, but 
normalizing constraints on p are not properly as- 
sessed. 

A similar approach is adopted for the compari- 
son of two correlation coefficients, with some quasi- 
hierarchical arguments (see Section 6.5) for the defi- 
nition of the prior under the alternative. Section 5.6 
is devoted to a very specific case of correlation anal- 
ysis that corresponds to our modern random effect 
model. A major part of this section argues in fa- 
vor of the model based on observations in various 
fields, but the connection with the chapter is the 
devising of a test for the presence of those random 
effects. The model is then formalized as normal ob- 
servations Xr ~ A/'(/i, -|- jkr) (1 < < m), where 
kr denotes the number of observations within class 
r and r is the variance of the random effect. The 
null hypothesis is therefore : r = 0. Even at this 
stage, the development is not directly relevant, ex- 
cept for approximation purposes, and the few lines 
of discussion about the Bayes factor indicate that 
the (testing) Jeffreys prior on r should be in l/r^ 



''^To be more specific, a normalizing constant c on the dis- 
tribution of pi that depends on p appears in the closed-form 
expression of as, for instance, in equation (14). 



/or small r^, without further specification. The (nu- 
merical) complexity of the problem may explain why 
Jeffreys differs from his usual processing, although 
current computational tools obviously allow for a 
complete processing (modulo the proper choice of a 
prior on r) (see, e.g., Ghosh and Meeden, 1984). 

Jeffreys also advocates using this principle for test- 
ing a normal distribution against alternatives from 
the Pearson family of distributions in Section 5.7, 
but no detail is given as to how J is computed and 
how the Bayes factor is derived. Similarly, for the 
comparison of the Poisson distribution with the neg- 
ative binomial distribution in Section 5.8, the form 
of J is provided for the distance between the two 
distributions, but the corresponding Bayes factor is 
only given via a very crude approximation with no 
mention of the corresponding priors. 

In Section 5.9 the extension of the (regular) model 
to the case of (linear) regression and of variable se- 
lection is briefiy considered, noticing that (a) for a 
single regressor (Section 5.91), the problem is ex- 
actly equivalent to testing whether or not a normal 
mean p is equal to and (b) for more than one re- 
gressor (Section 5.92), the test of nullity of one coef- 
ficient can be done conditionally on the others, that 
is, they can be treated as nuisance parameters under 
both hypotheses. (The case of linear calibration in 
Section 5.93 is also processed as a by-product.) 

6.5 A Foray into Hierarchical Bayes 

Section 5.4 explores further tests related to the 
normal distribution, but Section 5.41 starts with a 
highly unusual perspective. When testing whether 
or not the means of two normal samples — with like- 
lihood L{pi, p2,cr) proportional to 



[xi - Pi] 



"2 



niSi -I- n2S2 
2a^ 



— are equal, that is, Hq : pi= P2, Jeffreys also intro- 
duces the value of the common mean, p, into the 
alternative. A possible, albeit slightly apocryphal, 
interpretation is to consider p as an hyperparame- 
ter that appears both under the null and under the 
alternative, which is then an incentive to use a single 
improper prior under both hypotheses (once again 
because of the lack of relevance of the correspond- 
ing pseudo- normalizing constant). But there is still 
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a difficulty with the introduction of three different 
alternatives with a hyperparameter /i: 

fii = /.f and ^2 7^ /U, /ii / /L( and fi2 = 

^1 / /i and ^2 / 

Given that has no intrinsic meaning under the al- 
ternative, the most logical"^^ translation of this mul- 
tiplication of alternatives is that the three formula- 
tions lead to three different priors, 



7rii(^,/ii,//2,cr) oc - 
7ri2(/^,/^i,/"2,cr) oc 



1 



rl,. 



1 

^a2 + (/i2-/x)2"'^i='^' 



2^M2=A" 



7ri3(/X,/ii,/i2,0-) 
1 



OC 



When vTii and 7ri2 are written in terms of a Dirac 
mass, they are clearly identical, 

vrii(/xi,Ai2,o") = vri2(/ii,/U2,cr) 
1 1 
7rcj2 -h (/ii -/i2)2' 

If we integrate out /x in 7ri3, the resulting posterior 
is 



7ri3(/Wi,/^2,o-) oc 



1 



vr 4cr2 -I- (/ii - /i2)^ ' 

whose only difference from tth is that the scale in 
the Cauchy is twice as large. As noticed later by 
Jeffreys, there is little to choose between the alterna- 
tives, even though the third modeling makes more 
sense from a modern, hierarchical point of view: /i 
and a denote the location and scale of the prob- 
lem, no matter which hypothesis holds, with an ad- 
ditional parameter {^1,^2) in the case of the alter- 
native hypothesis. Using a common improper prior 
under both hypotheses can then be justified via a 
limiting argument, as in Marin and Robert (2007), 
because those parameters are common to both mod- 
els. Seen as such, the Bayes factor 

— « — 1 I '^l /- \2 

^ exp<^ -^(xi - 



*®This does not seem to be JefFreys's perspective, since he 
later (in Sections 5.46 and 5.47) adds up the posterior prob- 
abilities of those three alternatives, effectively dividing the 
Bayes factor by 3 or such. 



n2 2 nisl + n2sl\ 



n2 .2 '^I'^l + ^^252 



2a 

_2 , / . ,.\2 



^(a;2 -/i2) 



2a2 



■ {cr^ -I- (/i2 - A*)^}) da dfi d/ii dfi2 

makes more sense because of the presence of a and /i 
on both the numerator and the denominator. While 
the numerator can be fully integrated into 

yV2^r{(n - l)/2}(ns2/2)-("-i)/2, 

where usq denotes the usual sum of squares, the 
denominator 



7r/2 



2 nisl + n2sl 



n2 I- ^ 

2^^^^-'^^) 2a^ 

J2. , /,. ,. \2 



/(4(T -|-(/Ui-/i2) )dadfj,idfi2 

does require numerical or Monte Carlo integration. 
It can actually be written as an expectation under 
the standard noninformative posteriors, 

a^ ~ ig{{n - 3)/2, (msf + n2S^)/2), 

Hi^M{xi,a'^/ni), /U2 ~ A/'(x2,cr^/n2), 

of the quantity 

f)(/xi,/i2,cr^) 

2 r((n - 3)/2){(nisf + n2si)/2}-("-3)/2 



4(t2 + (/ii - /i2)2 



When simulating a range of values of the sufficient 
statistics {ni,Xi,Si)i=i^2, the difference between the 
Bayes factor and Jeffreys's approximation. 



, vr n^n2 
K^2 -- 



2ni+n2 



1/2 



l + nin2ni+ n2 y 



is spectacular, as shown in Figure 9. The larger dis- 
crepancy (when compared to earlier figures) can be 
attributed in part to the larger number of sufficient 
statistics involved in this setting. 
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Fig. 9. Comparison of a Monte Carlo approximation to the 
Bayes factor for the normal mean comparison problem and 
of Jeffreys's approximation, corresponding to IQp statistics 
{ni,Xi,Si)i=i^2 and 10 generations from the noninformative 
posterior. 



A similar split of the alternative is studied in Sec- 
tion 5.42 when the standard deviations are differ- 
ent under both models, with further simplifications 
in Jeffreys's approximations to the posteriors (since 
the are integrated out). It almost seems as if 
xi — X2 acts pseudo-sufficient statistic. If we 
start from a generic representation with L(/xi,/i2) 
<7i,0'2) proportional to 



c^i ^2 'exp|-^(j;i -//i) 



n2 



2 nis{ n2s\ 



and if we use again 7r(/i, cji, (T2) oc l/cri(72 under the 
null hypothesis and 



7ril(/Ul,/U2,(Tl,fT2) oc 
7ri2(Ml>/"2,cri,cr2) OC 

7ri3(/i,;Ui,//2, 0-1,0-2) 

1 1 

oc 



1 1 



0-1 



0-10-2 vr 0-2 + (^2 -A*i)^' 

1 1 02 

o-io-2 7ro-| + (//2-/ii)2' 



0-10-2 



(71(72 vr2 {a\ + (/xi - + (^2 - lif} 

under the alternative, then, as stated in Theory of 
Probability, 



-ni—l 



a, 



-n2 — 1 



exp 



2(7? 



n2 I- x2 



2al 



n2S2 



27r/(n2(72 + ni(72)(7i-"V^"^ 



• exp 



{xi - X2)" 



2 ((72 /m + (72/^2) 

but the computation of 



2af 



n2sl 



exp 



ni ._ 



x2 i- 
W) -^(X2 



21102 of + {fx - Hi) 



[and the alternative versions] is not possible in closed 
form. We note that ttis corresponds to a distribution 
on the difference fii — //2 with density equal to 

7ri3(/Ul,/"2|o-l,0-2) 
1 



-((0-1 +0-2)(/il 
TT 



2 3\ 
O-1O-2 + 0-2) 

/([(W-/^2)' + C7?+(72]2 

1 ioi+(J2)iy'^ + al 



2 

0-1(72 



2a\a2 + o-|) 



vr (y2 + ((71 + (72)2)(y2 + {ax - 02^) 

_ 1 0-1 + 0-2 
7ry2 + (0-1 +0-2)2' 

thus equal to a Cauchy distribution with scale ((7i + 
(72).^'' Jeffreys uses instead a Laplace approxima- 
tion, 

2(71 1 

nin2 o-f + (xi - X2)2 ' 

to the above integral, with no further justification. 
Given the differences between the three formulations 
of the alternative hypothesis, it makes sense to try 
to compare further those three priors (in our re- 
interpretation as hierarchical priors). As noted by 
Jeffreys, there may be considerable grounds for de- 
cision between the alternative hypotheses. It seems 
to us (based on the Laplace approximations) that 
the most sensible prior is the hierarchical one, vria. 



^^While this result follows from the derivation of the den- 
sity by integration, a direct proof follows from considering 
the characteristic function of the Cauchy distribution C(0, a), 
equal to exp— o-|^| (see Feller, 1971). 
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in that the scale depends on both variances rather 
than only one. 

An extension of the test on a (normal) standard 
deviation is considered in Section 5.44 for the agree- 
ment of two estimated standard errors. Once again, 
the most straightforward interpretation of Jeffreys's 
derivation is to see it hierarchical modeling, 

with a reference prior TT{a) = l/a on a global scale, 
0"! say, and the corresponding (testing) Jeffreys prior 
on the ratio oxja^ = exp^. The Bayes factor (in fa- 
vor of the null hypothesis) is then given by 

K = :^ 

vr 

/ cosh(c) _„,c^!!l£!^:ll±!^V"^%r 

/ y_^cosh(2C)^ V n2e2- + n2 J ^' 

if z denotes logsi/s2 = log (Ti /(5"2 . 

6.6 P-what?! 

Section 5.6 embarks upon a historically interest- 
ing discussion on the warnings given by too good a 
p-value: if, for instance, a t^st leads to a value of 
the statistic that is very small, this means (al- 
most certain) incompatibility with the assump- 
tion just as well as too large a value. (Jeffreys re- 
calls the example of the data set of Mendel that was 
modified by hand to agree with the Mendelian law 
of inheritance, leading to too small a value.) This 
can be seen as an indirect criticism of the standard 
tests (see also Section 8 below). 

7. CHAPTER VI: SIGNIFICANCE TESTS: 
VARIOUS COMPLICATIONS 

The best way of testing differences from a 
systematic rule is always to arrange our work so as 
to ask and answer one question at a time. 
H. Jeffreys, Theory of Probability, Section 6.1. 

This chapter appears as a marginalia of the pre- 
vious one in that it contains no major advance but 
rather a sequence of remarks, such as, for instance, 
an entry on time-series models (see Section 7.2 be- 
low). The very first paragraph of this chapter pro- 
duces a remarkably simple and intuitive justification 
of the incompatibility between improper priors and 
significance tests: the mere fact that we are seriously 
considering the possibility that it is zero may be as- 
sociated with a presumption that if it is not zero it 
is probably small. 



Then, Section 6.0 discusses the difficulty of set- 
tling for an informative prior distribution that takes 
into account the actual state of knowledge. By subdi- 
viding the sample into groups, different conclusions 
can obviously be reached, but this contradicts the 
Likelihood Principle that the whole data set must 
be used simultaneously. Of course, this could also 
be interpreted as a precursor attempt at defining 
pseudo-Bayes factors (Berger and Pericchi, 1996). 
Otherwise, as correctly pointed out by Jeffreys, the 
prior probability when each subsample is considered 
is not the original prior probability but the poste- 
rior probability left by the previous one, which is 
the basic implementation of the Bayesian learning 
principle. However, even with this correction, the 
final outcome of a sequential approach is not the 
proper Bayesian solution, unless posteriors are also 
used within the integrals of the Bayes factor. 

Section 6.5 also recapitulates both Chapters V 
and VI with general comments. It reiterates the warn- 
ing, already made earlier, that the Bayes factors ob- 
tained via this noninformative approach are usually 
rarely immensely in favor of Hq. This somehow con- 
tradicts later studies, like those of Berger and Sellke 
(1987) and Berger, Boukai and Wang (1997), that 
the Bayes factor is generally less prone to reject the 
null hypothesis. Jeffreys argues that, when an alter- 
native is actually used (. . .), the probability that it is 
false is always of order n~^/^, without further justi- 
fication. Note that this last section also includes the 
seeds of model averaging: when a set of alternative 
hypotheses (models Tlr) is considered, the predic- 
tive should be 

p{x'\x) = 'y^^Pr{x'\x)TT{'dJlr\x), 

r 

rather than conditional on the accepted hypothesis. 
Obviously, when K is large, [this] will give almost 
the same inference as the selected model/hypothesis. 

7.1 Multiple Parameters 

Although it should proceed from first principles, 
the extension of Jeffreys's (second) rule for selection 
priors (see Section 6.4) to several parameters is dis- 
cussed in Sections 6.1 and 6.2 in a spirit similar to 
the reference priors of Berger and Bernardo (1992), 
by pointing out that, if two parameters a and 13 are 
introduced sequentially against the null hypothesis 
Hq : a = /? = 0, testing first that a 7^ then /? 7^ 
conditional on a does not lead to the same joint 
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prior as the symmetric steps of testing first /3 7^ 
then a 7^ conditional on f3. In fact, 

1 /9 1/2 

darctanJ^' darctanJol 

p\a 

/ d ar ctan J^^^ d arctan J^j'^ . 

Jeffreys then suggests using instead the marginal- 
ized version 



7r(a,/3) 



, ji/2 djlj'^ 

7r2 1 + J„ 1 + ' 



although he acknowledges that there are cases where 
the symmetry does not make sense (as, for instance, 
when parameters are not defined under the null, as, 
e.g., in a mixture setting). He then resorts to Ock- 
ham's razor (Section 6.12) to rank those unidimen- 
sional tests by stating that there is a best order of 
procedures, although there are cases where such an 
ordering is arbitrary or not even possible. Section 6.2 
considers a two-dimensional parameter (A,/i) and, 
switching to polar coordinates, uses a (half-)Cauchy 
prior on the radius p = \/A^-|-/? (and a uniform 
prior on the angle). The Bayes factor for testing the 
nullity of the parameter (A,/i) is then 



K= /fT-2"-iexp 



1 



2ns'^ + n(rc^ + y^) 
2^2 



da 



' exp{— (2ns^ 

+ n{[x-Xf 



2\\ /o 2i dXdfida 



2^^(71 - l)!{2ns2 + n{x^ + f)y 
1 



• exp 



n 
'2^ 



[2s' + p' 



2-, 1 dcf) dp da 



2ppcos(p + p i , , 
p{a^ + p^) 

where = + and which can only be integrated 
up to 



1 _ 2 



exp 



2 2 
ns V 

"2s2+p2 



iFiU-n,l, 



-^2 2 
np V 



dv 



2(2s2+p2) J 1 + ^2' 

iFi denoting a confluent hypergeometric function. 
A similar analysis is conducted in Section 6.21 for 
a linear regression model associated with a pair of 
harmonics {xt = a cost -|- /3sint + £t)^ the only dif- 
ference being the inclusion of the covariate scales A 
and B within the prior, 

7r(a, /3|cj) 
_ + 52 
7r2V2 

a 



Va2+^2{^2 + (^2 + 52)(c,2 + ^2)/2} 

7.2 Markovian Models 

While the title of Section 6.3 [Partial and serial 
correlation) is slightly misleading, this section deals 
with an AR{1) model, 

Xt+l = pXt + TEt. 

It is not conclusive with respect to the selection of 
the prior on p given that Jeffreys does not consider 
the null value p = but rather p = ±1 which leads 
to difficulties, if only because there is no stationary 
distribution in that case. Since the Kullback diver- 
gence is given by 

1 + 



(l-p2)(l_p/2) 



{P' - Pf 



Jeffreys's (testing) prior (against Hq: p = 0) should 
be 



1 



1 j'/Hp,oy 1 

7ri+j(p,o) ^yr^' 

which is also Jeffreys's regular (estimation) prior in 
that case. 

The (other) correlation problem of Section 6.4 also 
deals with a Markov structure, namely, that 



Pixt+i = s\xt 



a + {l — a)pr, if s = r, 
(1 — a)ps, otherwise. 



the null (independence) hypothesis corresponding 
to HQ:a = 0. Note that this parameterization of 
the Markov model means that the pr-'s are the sta- 
tionary probabilities. The Kullback divergence being 
particularly intractable. 



J = aS^Pr\oA IH — 



a] 
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Fig. 10. Jeffreys 's prior of the coefficient a for the Markov 
model of Section 6.4. 

Jeffreys first produces tlie approximation 

{m — 1)0^ 



1 



a 



tliat would lead to the (testing) prior 

2 l-a/2 

vr yjl - a(l - a + a^) 



[since the primitive of the above is — arctan(-v/l — a/ 
a)], but the possibility of negative^^ a leads him to 
use instead a flat prior on the possible range of a's. 
Note from Figure 10 that the above prior is quite 
peaked in q = 1. 

8. CHAPTER VII: FREQUENCY DEFINITIONS 
AND DIRECT METHODS 

An hypothesis that may be true may be rejected 
because it has not predicted observable 
results that have not occurred. 
H. Jeffreys, Theory of Probability, Section 7.2. 

This short chapter opposes the classical approaches 
of the time (Fisher's fiducial and likelihood method- 
ologies, Pearson's and Neyman's p-values) to the 
Bayesian principles developed in the earlier chap- 
ters. (The very first part of the chapter is a digres- 
sion on the "frequentist" theories of probability that 
is not particularly relevant from a mathematical per- 
spective and that we have already addressed earlier. 



Because of the very specific (unidimensional) parameter- 
ization of the Markov chain, using a negative a indeed makes 
sense. 



See, however, Dawid, 2004, for a general synthesis 
on this point.) The fact that Student's and Fisher's 
analyses of the t statistic coincide with Jeffreys's is 
seen as an argument in favor both of the Bayesian 
approach and of the choice of the reference prior 
7r(/U, a) ocl/a. 

The most famous part of the chapter (Section 7.2) 
contains the often-quoted sentence above, which ap- 
plies to the criticism of p- values, since a decision to 
reject the null hypothesis is based on the observed 
p- value being in the upper tail of its distribution un- 
der the null, even though nothing but the observed 
value is relevant. Given that the p-value is a one- 
to-one transform of the original test statistics, the 
criticism is maybe less virulent than it appears: Jef- 
freys still refers to twice the standard error as a cri- 
terion for possible genuineness and three times the 
standard error for definite acceptance. The major 
criticism that this quantity does not account for the 
alternative hypothesis (as argued, for instance, in 
Berger and Wolpert, 1988) does not appear at this 
stage, but only later in Section 7.22. As perceived 
in Theory of Probability, the problem with Pear- 
son's and Fisher's approaches is therefore rather the 
use of a convenient bound on the test statistic as 
two standard deviations (or on the p- value as 0.05). 
There is, however, an interesting remark that the 
choice of the hypothesis should eventually be aimed 
at selecting the best inference, even though Jeffreys 
concludes that there is no way of stating this suffi- 
ciently precisely to be of any use. Again, expressing 
this objective in decision-theoretic terms seems the 
most natural solution today. Interestingly, the fol- 
lowing sentence in Section 7.51 could be interpreted, 
once again in an apocryphal way, as a precursor to 
decision theory: There are cases where there is no 
positive new parameter, but important consequences 
might follow if it was not zero, leading to loss func- 
tions mixing estimation and testing as in Robert and 
Casella (1994). 

In Section 7.5 we find a similarly interesting rein- 
terpretation of the classical first and second type 
errors, computing an integrated error based on the 
0-1 loss (even though it is not defined this way) as 



fi{x)dx-\- / fo{x)dx 



where x is the test statistic, /o and /i are the mar- 
ginals under the null and under the alternative, re- 
spectively, and Oc is the bound for accepting Hq. The 
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optimal value of Oc is therefore given by /o(ac) = 
/i(ac), which amounts to 

Tr{Ho\x = Oc) = tt{Hq\x = Oc), 

that is, K = 1 if both hypotheses are equally weighted 
a priori. This is a completely rigorous derivation 
of the optimal Bayesian decision for testing, even 
though Jeffreys does not approach it this way, in 
particular, because the prior probabilities are not 
necessarily equal (a point discussed earlier in Sec- 
tion 6.0 for instance). It is nonetheless a fairly con- 
vincing argument against p- values in terms of small- 
est number of mistakes. More prosaically, Jeffreys 
briefly discusses in this section the disturbing asym- 
metry of frequentist tests, when both hypotheses are 
of the same type: if we must choose between two def- 
initely stated alternatives, we should naturally take 
the one that gives the larger likelihood, even though 
each may be within the range of acceptance of the 
other. 

9. CHAPTER VIII: GENERAL QUESTIONS 

A prior probability used to express ignorance is 
merely the formal statement of that ignorance. 

H. Jeffreys, Theory of Probability, Section 8.1. 

This concluding chapter summarizes the main rea- 
sons for using the Bayesian perspective: 

I. Prior and sampling probabilities are represen- 
tations of degrees of belief rather than frequencies 
(Section 8.0). Once again, we believe that this de- 
bate^^ is settled today, by considering that proba- 
bility distributions and improper priors are defined 
according to the rules of measure theory; see, how- 
ever, Dawid (2004) for another perspective oriented 
toward calibration. 

2. While prior probabilities are subjective and can- 
not be uniquely assessed, Theory of Probability sets 
a general (objective) principle for the derivation of 
prior distributions (Section 8.1). It is quite inter- 
esting to read Jeffreys's defence of this point when 
taking into account the fact that this book was set- 
ting the point of reference for constructing nonin- 
formative priors. Theory of Probability does little, 
however, toward the construction of informative pri- 
ors by integrating existing prior information (except 



Jeffreys argues tfiat the limit definition was not stated till 
eighty years later than Bayes, which sounds incorrect when 
considering that the Law of Large Numbers was produced by 
Bernoulli in Ars Conjectandi. 



in the sequential case discussed earlier), recogniz- 
ing nonetheless the natural discrepancy between two 
probability distributions conditional on two differ- 
ent data sets. More fundamentally, this stresses that 
Theory of Probability focuses on prior probabilities 
used to express ignorance more than anything else. 

3. Bayesian statistics naturally allow for model 
specification and, as such, do not suffer (as much) 
from the neglect of an unforeseen alternative (Sec- 
tion 8.2). This is obviously true only to some extent: 
if, in the process of comparing models Tli based on 
an experiment, one very likely model is omitted from 
the list, the consequences may be severe. On the 
other hand, and in relation to the previous discus- 
sion on the p-values, the Bayesian approach allows 
for alternative models and is thus naturally embed- 
ding model specification within its paradigm. ^'^ The 
fact that it requires an alternative hypothesis to op- 
erate a test is an illustration of this feature. 

4. Different theories leading to the same poste- 
riors cannot be distinguished since questions that 
cannot be decided by means of observations are best 
left alone (Section 8.3). The physicists'^^ concept 
of rejection of unobservables is to be understood as 
the elimination of parameters in a law that make no 
contribution to the results of any observation or as a 
version of Ockham's principle, introducing new pa- 
rameters only when observations showed them to be 
necessary (Section 8.4). See Dawid (1984, 2004) for 
a discussion of this principle he calls Jeffreys 's Law. 

5. The theory of Bayesian statistics as presented 
in Theory of Probability is consistent in that it pro- 
vides general rules to construct noninformative pri- 
ors and to conduct tests of hypotheses (Section 8.6). 
It is in agreement with the Likelihood Principle and 
with conditioning on sufficient statistics. It also 
avoids the use of p- values for testing hypotheses by 
requiring no empirical hypothesis to be true or false 



The point about being prepared for occasional wrong de- 
cisions could possibly be related to Popper's notion of fal- 
sifiability: by picking a specific prior, it is always possible to 
modify inference toward one's goal. Of course, the divergences 
between Jeffreys's and Popper's approaches to induction make 
them quite irreconcilable. See Dawid (2004) for a Bayes-de 
Finetti-Popper synthesis. 

^^Both paragraphs Sections 8.3 and 8.4 seem only con- 
cerned with a physicists' debate, particularly about the rele- 
vance of quantum theory. 

^^We recall that Fisher information is not fully compatible 
with the Likelihood Principle (Berger and Wolpert, 1988). 
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a priori. However, special cases and multidimen- 
sional settings show that this theory cannot claim 
to be completely universal. 

6. The final paragraph of Theory of Probability 
states that the present theory does not justify induc- 
tion; what it does is to provide rules for consistency. 
This is absolutely coherent with the above: although 
the book considers many special cases and excep- 
tions, it does provide a general rule for conduct- 
ing point inference (estimation) and testing of hy- 
potheses by deriving generic rules for the construc- 
tion of noninformative priors. Many other solutions 
are available, but the consistency cannot be denied, 
while a ranking of those solutions is unthinkable. 
In essence. Theory of Probability has thus mostly 
achieved its goal of presenting a self-contained the- 
ory of inference based on a minimum of assumptions 
and covering the whole field of inferential purposes. 

10. CONCLUSION 

It is essential to the possibility of induction that we 
shall be prepared for occasional wrong decisions. 
H. Jeffreys, Theory of Probability, Section 8.2. 

Despite a tone that some may consider as overly 
critical, and therefore unfair to such a pioneer in our 
field, this perusal of Theory of Probability leaves us 
with the feeling of a considerable achievement to- 
ward the formalization of Bayesian theory and the 
construction of an objective and consistent frame- 
work. Besides setting the Bayesian principle in full 
generality. 

Posterior Probability oc Prior Probability ■ Likelihood , 

including using improper priors indistinctly from prop- 
er priors, the book sets a generic theory for selecting 
reference priors in general inferential settings, 

7r(0)«|/(0)|i/2, 
as well as when testing point null hypotheses, 

-^ = -d{tan-^j'/He)}, 

vr 1 + J vr 

when J{9) = div{f{-\0o), f{-\0)} is a divergence mea- 
sure between the sampling distribution under the 
null and under the alternative. The lack of a decision- 
theoretic formalism for point estimation notwith- 
standing, Jeffreys sets up a completely operational 
technology for hypothesis testing and model choice 



that is centered on the Bayes factor. Premises of hi- 
erarchical Bayesian analysis, reference priors, match- 
ing priors and mixture analysis can be found at var- 
ious places in the book. That it sometimes lacks 
mathematical rigor and often indulges in debates 
that may look superficial today is once again a re- 
flection of the idiosyncrasies of the time: even the 
ultimate revolutions cannot be built on void and 
they do need the shoulders of earlier giants to step 
further. We thus absolutely acknowledge the depth 
and worth of Theory of Probability as a foundational 
text for Bayesian Statistics and hope that the cur- 
rent review may help in its reassessment. 
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